From 75182c755e95a4e6e283e7d8064398d8550a2172 Mon Sep 17 00:00:00 2001 From: AlongWY Date: Thu, 19 Oct 2023 05:21:27 +0000 Subject: [PATCH] deploy: 72066be21ad467c8ffc76b74c152b38decf3f0ac --- .nojekyll | 0 cache.json | 1 + favicon.ico | Bin 0 -> 15086 bytes index.css | 355 + index.html | 82160 ++++++++++++++++++++++++++++++++++++++++++++++++++ index.js | 39 + 6 files changed, 82555 insertions(+) create mode 100644 .nojekyll create mode 100644 cache.json create mode 100644 favicon.ico create mode 100644 index.css create mode 100644 index.html create mode 100644 index.js diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 00000000..e69de29b diff --git a/cache.json b/cache.json new file mode 100644 index 00000000..61e54dfe --- /dev/null +++ b/cache.json @@ -0,0 +1 @@ +{"2023-10-11T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2305.05658v2","updated":"2023-10-11T17:59:44Z","published":"2023-05-09T17:52:59Z","title":"TidyBot: Personalized Robot Assistance with Large Language Models","summary":" For a robot to personalize physical assistance effectively, it must learn\nuser preferences that can be generally reapplied to future scenarios. In this\nwork, we investigate personalization of household cleanup with robots that can\ntidy up rooms by picking up objects and putting them away. A key challenge is\ndetermining the proper place to put each object, as people's preferences can\nvary greatly depending on personal taste or cultural background. For instance,\none person may prefer storing shirts in the drawer, while another may prefer\nthem on the shelf. We aim to build systems that can learn such preferences from\njust a handful of examples via prior interactions with a particular person. We\nshow that robots can combine language-based planning and perception with the\nfew-shot summarization capabilities of large language models (LLMs) to infer\ngeneralized user preferences that are broadly applicable to future\ninteractions. This approach enables fast adaptation and achieves 91.2% accuracy\non unseen objects in our benchmark dataset. We also demonstrate our approach on\na real-world mobile manipulator called TidyBot, which successfully puts away\n85.0% of objects in real-world test scenarios.\n","authors":["Jimmy Wu","Rika Antonova","Adam Kan","Marion Lepert","Andy Zeng","Shuran Song","Jeannette Bohg","Szymon Rusinkiewicz","Thomas Funkhouser"],"pdf_url":"https://arxiv.org/pdf/2305.05658v2.pdf","comment":"Accepted to Autonomous Robots (AuRo) - Special Issue: Large Language\n Models in Robotics, 2023 and IEEE/RSJ International Conference on Intelligent\n Robots and Systems (IROS), 2023. Project page:\n https://tidybot.cs.princeton.edu"},{"id":"http://arxiv.org/abs/2310.07715v1","updated":"2023-10-11T17:59:36Z","published":"2023-10-11T17:59:36Z","title":"To Build Our Future, We Must Know Our Past: Contextualizing Paradigm\n Shifts in Natural Language Processing","summary":" NLP is in a period of disruptive change that is impacting our methodologies,\nfunding sources, and public perception. In this work, we seek to understand how\nto shape our future by better understanding our past. We study factors that\nshape NLP as a field, including culture, incentives, and infrastructure by\nconducting long-form interviews with 26 NLP researchers of varying seniority,\nresearch area, institution, and social identity. Our interviewees identify\ncyclical patterns in the field, as well as new shifts without historical\nparallel, including changes in benchmark culture and software infrastructure.\nWe complement this discussion with quantitative analysis of citation,\nauthorship, and language use in the ACL Anthology over time. We conclude by\ndiscussing shared visions, concerns, and hopes for the future of NLP. We hope\nthat this study of our field's past and present can prompt informed discussion\nof our community's implicit norms and more deliberate action to consciously\nshape the future.\n","authors":["Sireesh Gururaja","Amanda Bertsch","Clara Na","David Gray Widder","Emma Strubell"],"pdf_url":"https://arxiv.org/pdf/2310.07715v1.pdf","comment":"Accepted to EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.07713v1","updated":"2023-10-11T17:59:05Z","published":"2023-10-11T17:59:05Z","title":"InstructRetro: Instruction Tuning post Retrieval-Augmented Pretraining","summary":" Pretraining auto-regressive large language models (LLMs) with retrieval\ndemonstrates better perplexity and factual accuracy by leveraging external\ndatabases. However, the size of existing pretrained retrieval-augmented LLM is\nstill limited (e.g., Retro has 7.5B parameters), which limits the effectiveness\nof instruction tuning and zero-shot generalization. In this work, we introduce\nRetro 48B, the largest LLM pretrained with retrieval before instruction tuning.\nSpecifically, we continue to pretrain the 43B GPT model on additional 100\nbillion tokens using the Retro augmentation method by retrieving from 1.2\ntrillion tokens. The obtained foundation model, Retro 48B, largely outperforms\nthe original 43B GPT in terms of perplexity. After instruction tuning on Retro,\nInstructRetro demonstrates significant improvement over the instruction tuned\nGPT on zero-shot question answering (QA) tasks. Specifically, the average\nimprovement of InstructRetro is 7% over its GPT counterpart across 8 short-form\nQA tasks, and 10% over GPT across 4 challenging long-form QA tasks.\nSurprisingly, we find that one can ablate the encoder from InstructRetro\narchitecture and directly use its decoder backbone, while achieving comparable\nresults. We hypothesize that pretraining with retrieval makes its decoder good\nat incorporating context for QA. Our results highlights the promising direction\nto obtain a better GPT decoder for QA through continued pretraining with\nretrieval before instruction tuning.\n","authors":["Boxin Wang","Wei Ping","Lawrence McAfee","Peng Xu","Bo Li","Mohammad Shoeybi","Bryan Catanzaro"],"pdf_url":"https://arxiv.org/pdf/2310.07713v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07712v1","updated":"2023-10-11T17:59:02Z","published":"2023-10-11T17:59:02Z","title":"Found in the Middle: Permutation Self-Consistency Improves Listwise\n Ranking in Large Language Models","summary":" Large language models (LLMs) exhibit positional bias in how they use context,\nwhich especially complicates listwise ranking. To address this, we propose\npermutation self-consistency, a form of self-consistency over ranking list\noutputs of black-box LLMs. Our key idea is to marginalize out different list\norders in the prompt to produce an order-independent ranking with less\npositional bias. First, given some input prompt, we repeatedly shuffle the list\nin the prompt and pass it through the LLM while holding the instructions the\nsame. Next, we aggregate the resulting sample of rankings by computing the\ncentral ranking closest in distance to all of them, marginalizing out prompt\norder biases in the process. Theoretically, we prove the robustness of our\nmethod, showing convergence to the true ranking in the presence of random\nperturbations. Empirically, on five list-ranking datasets in sorting and\npassage reranking, our approach improves scores from conventional inference by\nup to 7-18% for GPT-3.5 and 8-16% for LLaMA v2 (70B), surpassing the previous\nstate of the art in passage reranking. Our code is at\nhttps://github.com/castorini/perm-sc.\n","authors":["Raphael Tang","Xinyu Zhang","Xueguang Ma","Jimmy Lin","Ferhan Ture"],"pdf_url":"https://arxiv.org/pdf/2310.07712v1.pdf","comment":"First two authors contributed equally; 10 pages, 6 figures"},{"id":"http://arxiv.org/abs/2310.07710v1","updated":"2023-10-11T17:57:35Z","published":"2023-10-11T17:57:35Z","title":"DiPmark: A Stealthy, Efficient and Resilient Watermark for Large\n Language Models","summary":" Watermarking techniques offer a promising way to secure data via embedding\ncovert information into the data. A paramount challenge in the domain lies in\npreserving the distribution of original data during watermarking. Our research\nextends and refines existing watermarking framework, placing emphasis on the\nimportance of a distribution-preserving (DiP) watermark. Contrary to the\ncurrent strategies, our proposed DiPmark preserves the original token\ndistribution during watermarking (stealthy), is detectable without access to\nthe language model API or weights (efficient), and is robust to moderate\nchanges of tokens (resilient). This is achieved by incorporating a novel\nreweight strategy, combined with a hash function that assigns unique\n\\textit{i.i.d.} ciphers based on the context. The empirical benchmarks of our\napproach underscore its stealthiness, efficiency, and resilience, making it a\nrobust solution for watermarking tasks that demand impeccable quality\npreservation.\n","authors":["Yihan Wu","Zhengmian Hu","Hongyang Zhang","Heng Huang"],"pdf_url":"https://arxiv.org/pdf/2310.07710v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07707v1","updated":"2023-10-11T17:57:14Z","published":"2023-10-11T17:57:14Z","title":"MatFormer: Nested Transformer for Elastic Inference","summary":" Transformer models are deployed in a wide range of settings, from\nmulti-accelerator clusters to standalone mobile phones. The diverse inference\nconstraints in these scenarios necessitate practitioners to train foundation\nmodels such as PaLM 2, Llama, & ViTs as a series of models of varying sizes.\nDue to significant training costs, only a select few model sizes are trained\nand supported, limiting more fine-grained control over relevant tradeoffs,\nincluding latency, cost, and accuracy. This work introduces MatFormer, a nested\nTransformer architecture designed to offer elasticity in a variety of\ndeployment constraints. Each Feed Forward Network (FFN) block of a MatFormer\nmodel is jointly optimized with a few nested smaller FFN blocks. This training\nprocedure allows for the Mix'n'Match of model granularities across layers --\ni.e., a trained universal MatFormer model enables extraction of hundreds of\naccurate smaller models, which were never explicitly optimized. We empirically\ndemonstrate MatFormer's effectiveness across different model classes (decoders\n& encoders), modalities (language & vision), and scales (up to 2.6B\nparameters). We find that a 2.6B decoder-only MatFormer language model (MatLM)\nallows us to extract smaller models spanning from 1.5B to 2.6B, each exhibiting\ncomparable validation loss and one-shot downstream evaluations to their\nindependently trained counterparts. Furthermore, we observe that smaller\nencoders extracted from a universal MatFormer-based ViT (MatViT) encoder\npreserve the metric-space structure for adaptive large-scale retrieval.\nFinally, we showcase that speculative decoding with the accurate and consistent\nsubmodels extracted from MatFormer can further reduce inference latency.\n","authors":[" Devvrit","Sneha Kudugunta","Aditya Kusupati","Tim Dettmers","Kaifeng Chen","Inderjit Dhillon","Yulia Tsvetkov","Hannaneh Hajishirzi","Sham Kakade","Ali Farhadi","Prateek Jain"],"pdf_url":"https://arxiv.org/pdf/2310.07707v1.pdf","comment":"31 pages, 12 figures, first three authors contributed equally"},{"id":"http://arxiv.org/abs/2310.07704v1","updated":"2023-10-11T17:55:15Z","published":"2023-10-11T17:55:15Z","title":"Ferret: Refer and Ground Anything Anywhere at Any Granularity","summary":" We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of\nunderstanding spatial referring of any shape or granularity within an image and\naccurately grounding open-vocabulary descriptions. To unify referring and\ngrounding in the LLM paradigm, Ferret employs a novel and powerful hybrid\nregion representation that integrates discrete coordinates and continuous\nfeatures jointly to represent a region in the image. To extract the continuous\nfeatures of versatile regions, we propose a spatial-aware visual sampler, adept\nat handling varying sparsity across different shapes. Consequently, Ferret can\naccept diverse region inputs, such as points, bounding boxes, and free-form\nshapes. To bolster the desired capability of Ferret, we curate GRIT, a\ncomprehensive refer-and-ground instruction tuning dataset including 1.1M\nsamples that contain rich hierarchical spatial knowledge, with 95K hard\nnegative data to promote model robustness. The resulting model not only\nachieves superior performance in classical referring and grounding tasks, but\nalso greatly outperforms existing MLLMs in region-based and\nlocalization-demanded multimodal chatting. Our evaluations also reveal a\nsignificantly improved capability of describing image details and a remarkable\nalleviation in object hallucination. Code and data will be available at\nhttps://github.com/apple/ml-ferret\n","authors":["Haoxuan You","Haotian Zhang","Zhe Gan","Xianzhi Du","Bowen Zhang","Zirui Wang","Liangliang Cao","Shih-Fu Chang","Yinfei Yang"],"pdf_url":"https://arxiv.org/pdf/2310.07704v1.pdf","comment":"30 pages, 10 figures. Code/Project Website:\n https://github.com/apple/ml-ferret"},{"id":"http://arxiv.org/abs/2310.07700v1","updated":"2023-10-11T17:51:28Z","published":"2023-10-11T17:51:28Z","title":"Knowledge-enhanced Memory Model for Emotional Support Conversation","summary":" The prevalence of mental disorders has become a significant issue, leading to\nthe increased focus on Emotional Support Conversation as an effective\nsupplement for mental health support. Existing methods have achieved compelling\nresults, however, they still face three challenges: 1) variability of emotions,\n2) practicality of the response, and 3) intricate strategy modeling. To address\nthese challenges, we propose a novel knowledge-enhanced Memory mODEl for\nemotional suppoRt coNversation (MODERN). Specifically, we first devise a\nknowledge-enriched dialogue context encoding to perceive the dynamic emotion\nchange of different periods of the conversation for coherent user state\nmodeling and select context-related concepts from ConceptNet for practical\nresponse generation. Thereafter, we implement a novel memory-enhanced strategy\nmodeling module to model the semantic patterns behind the strategy categories.\nExtensive experiments on a widely used large-scale dataset verify the\nsuperiority of our model over cutting-edge baselines.\n","authors":["Mengzhao Jia","Qianglong Chen","Liqiang Jing","Dawei Fu","Renyu Li"],"pdf_url":"https://arxiv.org/pdf/2310.07700v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.08896v3","updated":"2023-10-11T17:43:28Z","published":"2023-03-15T19:31:21Z","title":"SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for\n Generative Large Language Models","summary":" Generative Large Language Models (LLMs) such as GPT-3 are capable of\ngenerating highly fluent responses to a wide variety of user prompts. However,\nLLMs are known to hallucinate facts and make non-factual statements which can\nundermine trust in their output. Existing fact-checking approaches either\nrequire access to the output probability distribution (which may not be\navailable for systems such as ChatGPT) or external databases that are\ninterfaced via separate, often complex, modules. In this work, we propose\n\"SelfCheckGPT\", a simple sampling-based approach that can be used to fact-check\nthe responses of black-box models in a zero-resource fashion, i.e. without an\nexternal database. SelfCheckGPT leverages the simple idea that if an LLM has\nknowledge of a given concept, sampled responses are likely to be similar and\ncontain consistent facts. However, for hallucinated facts, stochastically\nsampled responses are likely to diverge and contradict one another. We\ninvestigate this approach by using GPT-3 to generate passages about individuals\nfrom the WikiBio dataset, and manually annotate the factuality of the generated\npassages. We demonstrate that SelfCheckGPT can: i) detect non-factual and\nfactual sentences; and ii) rank passages in terms of factuality. We compare our\napproach to several baselines and show that our approach has considerably\nhigher AUC-PR scores in sentence-level hallucination detection and higher\ncorrelation scores in passage-level factuality assessment compared to grey-box\nmethods.\n","authors":["Potsawee Manakul","Adian Liusie","Mark J. F. Gales"],"pdf_url":"https://arxiv.org/pdf/2303.08896v3.pdf","comment":"EMNLP 2023 (main conference)"},{"id":"http://arxiv.org/abs/2310.04381v2","updated":"2023-10-11T17:36:12Z","published":"2023-10-06T17:19:40Z","title":"Hermes: Unlocking Security Analysis of Cellular Network Protocols by\n Synthesizing Finite State Machines from Natural Language Specifications","summary":" In this paper, we present Hermes, an end-to-end framework to automatically\ngenerate formal representations from natural language cellular specifications.\nWe first develop a neural constituency parser, NEUTREX, to process\ntransition-relevant texts and extract transition components (i.e., states,\nconditions, and actions). We also design a domain-specific language to\ntranslate these transition components to logical formulas by leveraging\ndependency parse trees. Finally, we compile these logical formulas to generate\ntransitions and create the formal model as finite state machines. To\ndemonstrate the effectiveness of Hermes, we evaluate it on 4G NAS, 5G NAS, and\n5G RRC specifications and obtain an overall accuracy of 81-87%, which is a\nsubstantial improvement over the state-of-the-art. Our security analysis of the\nextracted models uncovers 3 new vulnerabilities and identifies 19 previous\nattacks in 4G and 5G specifications, and 7 deviations in commercial 4G\nbasebands.\n","authors":["Abdullah Al Ishtiaq","Sarkar Snigdha Sarathi Das","Syed Md Mukit Rashid","Ali Ranjbar","Kai Tu","Tianwei Wu","Zhezheng Song","Weixuan Wang","Mujtahid Akon","Rui Zhang","Syed Rafiul Hussain"],"pdf_url":"https://arxiv.org/pdf/2310.04381v2.pdf","comment":"Accepted at USENIX Security 24"},{"id":"http://arxiv.org/abs/2306.12424v2","updated":"2023-10-11T17:34:19Z","published":"2023-06-21T17:59:51Z","title":"VisoGender: A dataset for benchmarking gender bias in image-text pronoun\n resolution","summary":" We introduce VisoGender, a novel dataset for benchmarking gender bias in\nvision-language models. We focus on occupation-related biases within a\nhegemonic system of binary gender, inspired by Winograd and Winogender schemas,\nwhere each image is associated with a caption containing a pronoun relationship\nof subjects and objects in the scene. VisoGender is balanced by gender\nrepresentation in professional roles, supporting bias evaluation in two ways:\ni) resolution bias, where we evaluate the difference between pronoun resolution\naccuracies for image subjects with gender presentations perceived as masculine\nversus feminine by human annotators and ii) retrieval bias, where we compare\nratios of professionals perceived to have masculine and feminine gender\npresentations retrieved for a gender-neutral search query. We benchmark several\nstate-of-the-art vision-language models and find that they demonstrate bias in\nresolving binary gender in complex scenes. While the direction and magnitude of\ngender bias depends on the task and the model being evaluated, captioning\nmodels are generally less biased than Vision-Language Encoders. Dataset and\ncode are available at https://github.com/oxai/visogender\n","authors":["Siobhan Mackenzie Hall","Fernanda Gonçalves Abrantes","Hanwen Zhu","Grace Sodunke","Aleksandar Shtedritski","Hannah Rose Kirk"],"pdf_url":"https://arxiv.org/pdf/2306.12424v2.pdf","comment":"Data and code available at https://github.com/oxai/visogender"},{"id":"http://arxiv.org/abs/2310.07676v1","updated":"2023-10-11T17:21:03Z","published":"2023-10-11T17:21:03Z","title":"Composite Backdoor Attacks Against Large Language Models","summary":" Large language models (LLMs) have demonstrated superior performance compared\nto previous methods on various tasks, and often serve as the foundation models\nfor many researches and services. However, the untrustworthy third-party LLMs\nmay covertly introduce vulnerabilities for downstream tasks. In this paper, we\nexplore the vulnerability of LLMs through the lens of backdoor attacks.\nDifferent from existing backdoor attacks against LLMs, ours scatters multiple\ntrigger keys in different prompt components. Such a Composite Backdoor Attack\n(CBA) is shown to be stealthier than implanting the same multiple trigger keys\nin only a single component. CBA ensures that the backdoor is activated only\nwhen all trigger keys appear. Our experiments demonstrate that CBA is effective\nin both natural language processing (NLP) and multimodal tasks. For instance,\nwith $3\\%$ poisoning samples against the LLaMA-7B model on the Emotion dataset,\nour attack achieves a $100\\%$ Attack Success Rate (ASR) with a False Triggered\nRate (FTR) below $2.06\\%$ and negligible model accuracy degradation. The unique\ncharacteristics of our CBA can be tailored for various practical scenarios,\ne.g., targeting specific user groups. Our work highlights the necessity of\nincreased security research on the trustworthiness of foundation LLMs.\n","authors":["Hai Huang","Zhengyu Zhao","Michael Backes","Yun Shen","Yang Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.07676v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.08703v3","updated":"2023-10-11T17:00:34Z","published":"2023-05-15T15:06:20Z","title":"Schema-adaptable Knowledge Graph Construction","summary":" Conventional Knowledge Graph Construction (KGC) approaches typically follow\nthe static information extraction paradigm with a closed set of pre-defined\nschema. As a result, such approaches fall short when applied to dynamic\nscenarios or domains, whereas a new type of knowledge emerges. This\nnecessitates a system that can handle evolving schema automatically to extract\ninformation for KGC. To address this need, we propose a new task called\nschema-adaptable KGC, which aims to continually extract entity, relation, and\nevent based on a dynamically changing schema graph without re-training. We\nfirst split and convert existing datasets based on three principles to build a\nbenchmark, i.e., horizontal schema expansion, vertical schema expansion, and\nhybrid schema expansion; then investigate the schema-adaptable performance of\nseveral well-known approaches such as Text2Event, TANL, UIE and GPT-3.5. We\nfurther propose a simple yet effective baseline dubbed \\textsc{AdaKGC}, which\ncontains schema-enriched prefix instructor and schema-conditioned dynamic\ndecoding to better handle evolving schema. Comprehensive experimental results\nillustrate that AdaKGC can outperform baselines but still have room for\nimprovement. We hope the proposed work can deliver benefits to the community.\nCode and datasets available at https://github.com/zjunlp/AdaKGC.\n","authors":["Hongbin Ye","Honghao Gui","Xin Xu","Huajun Chen","Ningyu Zhang"],"pdf_url":"https://arxiv.org/pdf/2305.08703v3.pdf","comment":"EMNLP 2023 (Findings)"},{"id":"http://arxiv.org/abs/2310.07659v1","updated":"2023-10-11T17:00:29Z","published":"2023-10-11T17:00:29Z","title":"Well Begun is Half Done: Generator-agnostic Knowledge Pre-Selection for\n Knowledge-Grounded Dialogue","summary":" Accurate knowledge selection is critical in knowledge-grounded dialogue\nsystems. Towards a closer look at it, we offer a novel perspective to organize\nexisting literature, i.e., knowledge selection coupled with, after, and before\ngeneration. We focus on the third under-explored category of study, which can\nnot only select knowledge accurately in advance, but has the advantage to\nreduce the learning, adjustment, and interpretation burden of subsequent\nresponse generation models, especially LLMs. We propose GATE, a\ngenerator-agnostic knowledge selection method, to prepare knowledge for\nsubsequent response generation models by selecting context-related knowledge\namong different knowledge structures and variable knowledge requirements.\nExperimental results demonstrate the superiority of GATE, and indicate that\nknowledge selection before generation is a lightweight yet effective way to\nfacilitate LLMs (e.g., ChatGPT) to generate more informative responses.\n","authors":["Qin Lang","Zhang Yao","Liang Hongru","Wang jun","Yang Zhenglu"],"pdf_url":"https://arxiv.org/pdf/2310.07659v1.pdf","comment":"Accepted by EMNLP2023 main conference"},{"id":"http://arxiv.org/abs/2310.07654v1","updated":"2023-10-11T16:54:57Z","published":"2023-10-11T16:54:57Z","title":"Audio-Visual Neural Syntax Acquisition","summary":" We study phrase structure induction from visually-grounded speech. The core\nidea is to first segment the speech waveform into sequences of word segments,\nand subsequently induce phrase structure using the inferred segment-level\ncontinuous representations. We present the Audio-Visual Neural Syntax Learner\n(AV-NSL) that learns phrase structure by listening to audio and looking at\nimages, without ever being exposed to text. By training on paired images and\nspoken captions, AV-NSL exhibits the capability to infer meaningful phrase\nstructures that are comparable to those derived by naturally-supervised text\nparsers, for both English and German. Our findings extend prior work in\nunsupervised language acquisition from speech and grounded grammar induction,\nand present one approach to bridge the gap between the two topics.\n","authors":["Cheng-I Jeff Lai","Freda Shi","Puyuan Peng","Yoon Kim","Kevin Gimpel","Shiyu Chang","Yung-Sung Chuang","Saurabhchand Bhati","David Cox","David Harwath","Yang Zhang","Karen Livescu","James Glass"],"pdf_url":"https://arxiv.org/pdf/2310.07654v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.13172v2","updated":"2023-10-11T16:51:50Z","published":"2023-05-22T16:00:00Z","title":"Editing Large Language Models: Problems, Methods, and Opportunities","summary":" Despite the ability to train capable LLMs, the methodology for maintaining\ntheir relevancy and rectifying errors remains elusive. To this end, the past\nfew years have witnessed a surge in techniques for editing LLMs, the objective\nof which is to efficiently alter the behavior of LLMs within a specific domain\nwithout negatively impacting performance across other inputs. This paper\nembarks on a deep exploration of the problems, methods, and opportunities\nrelated to model editing for LLMs. In particular, we provide an exhaustive\noverview of the task definition and challenges associated with model editing,\nalong with an in-depth empirical analysis of the most progressive methods\ncurrently at our disposal. We also build a new benchmark dataset to facilitate\na more robust evaluation and pinpoint enduring issues intrinsic to existing\ntechniques. Our objective is to provide valuable insights into the\neffectiveness and feasibility of each editing technique, thereby assisting the\ncommunity in making informed decisions on the selection of the most appropriate\nmethod for a specific task or context. Code and datasets are available at\nhttps://github.com/zjunlp/EasyEdit.\n","authors":["Yunzhi Yao","Peng Wang","Bozhong Tian","Siyuan Cheng","Zhoubo Li","Shumin Deng","Huajun Chen","Ningyu Zhang"],"pdf_url":"https://arxiv.org/pdf/2305.13172v2.pdf","comment":"EMNLP 2023. Updated with new experiments"},{"id":"http://arxiv.org/abs/2310.07652v1","updated":"2023-10-11T16:51:46Z","published":"2023-10-11T16:51:46Z","title":"LLM4Vis: Explainable Visualization Recommendation using ChatGPT","summary":" Data visualization is a powerful tool for exploring and communicating\ninsights in various domains. To automate visualization choice for datasets, a\ntask known as visualization recommendation has been proposed. Various\nmachine-learning-based approaches have been developed for this purpose, but\nthey often require a large corpus of dataset-visualization pairs for training\nand lack natural explanations for their results. To address this research gap,\nwe propose LLM4Vis, a novel ChatGPT-based prompting approach to perform\nvisualization recommendation and return human-like explanations using very few\ndemonstration examples. Our approach involves feature description,\ndemonstration example selection, explanation generation, demonstration example\nconstruction, and inference steps. To obtain demonstration examples with\nhigh-quality explanations, we propose a new explanation generation\nbootstrapping to iteratively refine generated explanations by considering the\nprevious generation and template-based hint. Evaluations on the VizML dataset\nshow that LLM4Vis outperforms or performs similarly to supervised learning\nmodels like Random Forest, Decision Tree, and MLP in both few-shot and\nzero-shot settings. The qualitative evaluation also shows the effectiveness of\nexplanations generated by LLM4Vis. We make our code publicly available at\n\\href{https://github.com/demoleiwang/LLM4Vis}{https://github.com/demoleiwang/LLM4Vis}.\n","authors":["Lei Wang","Songheng Zhang","Yun Wang","Ee-Peng Lim","Yong Wang"],"pdf_url":"https://arxiv.org/pdf/2310.07652v1.pdf","comment":"EMNLP 2023 (Industry Track)"},{"id":"http://arxiv.org/abs/2307.15770v2","updated":"2023-10-11T16:49:29Z","published":"2023-07-28T18:58:16Z","title":"CHATREPORT: Democratizing Sustainability Disclosure Analysis through\n LLM-based Tools","summary":" In the face of climate change, are companies really taking substantial steps\ntoward more sustainable operations? A comprehensive answer lies in the dense,\ninformation-rich landscape of corporate sustainability reports. However, the\nsheer volume and complexity of these reports make human analysis very costly.\nTherefore, only a few entities worldwide have the resources to analyze these\nreports at scale, which leads to a lack of transparency in sustainability\nreporting. Empowering stakeholders with LLM-based automatic analysis tools can\nbe a promising way to democratize sustainability report analysis. However,\ndeveloping such tools is challenging due to (1) the hallucination of LLMs and\n(2) the inefficiency of bringing domain experts into the AI development loop.\nIn this paper, we ChatReport, a novel LLM-based system to automate the analysis\nof corporate sustainability reports, addressing existing challenges by (1)\nmaking the answers traceable to reduce the harm of hallucination and (2)\nactively involving domain experts in the development loop. We make our\nmethodology, annotated datasets, and generated analyses of 1015 reports\npublicly available.\n","authors":["Jingwei Ni","Julia Bingler","Chiara Colesanti-Senni","Mathias Kraus","Glen Gostlow","Tobias Schimanski","Dominik Stammbach","Saeid Ashraf Vaghefi","Qian Wang","Nicolas Webersinke","Tobias Wekhof","Tingyu Yu","Markus Leippold"],"pdf_url":"https://arxiv.org/pdf/2307.15770v2.pdf","comment":"6 pages. arXiv admin note: text overlap with arXiv:2306.15518"},{"id":"http://arxiv.org/abs/2310.07644v1","updated":"2023-10-11T16:40:57Z","published":"2023-10-11T16:40:57Z","title":"Rethinking the BERT-like Pretraining for DNA Sequences","summary":" With the success of large-scale pretraining in NLP, there is an increasing\ntrend of applying it to the domain of life sciences. In particular, pretraining\nmethods based on DNA sequences have garnered growing attention due to their\npotential to capture generic information about genes. However, existing\npretraining methods for DNA sequences largely rely on direct adoptions of BERT\npretraining from NLP, lacking a comprehensive understanding and a specifically\ntailored approach. To address this research gap, we first conducted a series of\nexploratory experiments and gained several insightful observations: 1) In the\nfine-tuning phase of downstream tasks, when using K-mer overlapping\ntokenization instead of K-mer non-overlapping tokenization, both overlapping\nand non-overlapping pretraining weights show consistent performance\nimprovement.2) During the pre-training process, using K-mer overlapping\ntokenization quickly produces clear K-mer embeddings and reduces the loss to a\nvery low level, while using K-mer non-overlapping tokenization results in less\ndistinct embeddings and continuously decreases the loss. 3) Using overlapping\ntokenization causes the self-attention in the intermediate layers of\npre-trained models to tend to overly focus on certain tokens, reflecting that\nthese layers are not adequately optimized. In summary, overlapping tokenization\ncan benefit the fine-tuning of downstream tasks but leads to inadequate\npretraining with fast convergence. To unleash the pretraining potential, we\nintroduce a novel approach called RandomMask, which gradually increases the\ntask difficulty of BERT-like pretraining by continuously expanding its mask\nboundary, forcing the model to learn more knowledge. RandomMask is simple but\neffective, achieving top-tier performance across 26 datasets of 28 datasets\nspanning 7 downstream tasks.\n","authors":["Chaoqi Liang","Weiqiang Bai","Lifeng Qiao","Yuchen Ren","Jianle Sun","Peng Ye","Hongliang Yan","Xinzhu Ma","Wangmeng Zuo","Wanli Ouyang"],"pdf_url":"https://arxiv.org/pdf/2310.07644v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07641v1","updated":"2023-10-11T16:38:11Z","published":"2023-10-11T16:38:11Z","title":"Evaluating Large Language Models at Evaluating Instruction Following","summary":" As research in large language models (LLMs) continues to accelerate,\nLLM-based evaluation has emerged as a scalable and cost-effective alternative\nto human evaluations for comparing the ever increasing list of models. This\npaper investigates the efficacy of these \"LLM evaluators\", particularly in\nusing them to assess instruction following, a metric that gauges how closely\ngenerated text adheres to the given instruction. We introduce a challenging\nmeta-evaluation benchmark, LLMBar, designed to test the ability of an LLM\nevaluator in discerning instruction-following outputs. The authors manually\ncurated 419 pairs of outputs, one adhering to instructions while the other\ndiverging, yet may possess deceptive qualities that mislead an LLM evaluator,\ne.g., a more engaging tone. Contrary to existing meta-evaluation, we discover\nthat different evaluators (i.e., combinations of LLMs and prompts) exhibit\ndistinct performance on LLMBar and even the highest-scoring ones have\nsubstantial room for improvement. We also present a novel suite of prompting\nstrategies that further close the gap between LLM and human evaluators. With\nLLMBar, we hope to offer more insight into LLM evaluators and foster future\nresearch in developing better instruction-following models.\n","authors":["Zhiyuan Zeng","Jiatong Yu","Tianyu Gao","Yu Meng","Tanya Goyal","Danqi Chen"],"pdf_url":"https://arxiv.org/pdf/2310.07641v1.pdf","comment":"Under review"},{"id":"http://arxiv.org/abs/2306.13047v3","updated":"2023-10-11T16:33:39Z","published":"2023-06-22T17:13:08Z","title":"Analysis of the Cambridge Multiple-Choice Questions Reading Dataset with\n a Focus on Candidate Response Distribution","summary":" Multiple choice exams are widely used to assess candidates across a diverse\nrange of domains and tasks. To moderate question quality, newly proposed\nquestions often pass through pre-test evaluation stages before being deployed\ninto real-world exams. Currently, this evaluation process is manually\nintensive, which can lead to time lags in the question development cycle.\nStreamlining this process via automation can significantly enhance efficiency,\nhowever, there's a current lack of datasets with adequate pre-test analysis\ninformation. In this paper we analyse the Cambridge Multiple-Choice Questions\nReading Dataset; a multiple-choice comprehension dataset of questions at\ndifferent target levels, with corresponding candidate selection distributions.\nWe introduce the task of candidate distribution matching, propose several\nevaluation metrics for the task, and demonstrate that automatic systems trained\non RACE++ can be leveraged as baselines for our task. We further demonstrate\nthat these automatic systems can be used for practical pre-test evaluation\ntasks such as detecting underperforming distractors, where our detection\nsystems can automatically identify poor distractors that few candidates select.\n","authors":["Adian Liusie","Vatsal Raina","Andrew Mullooly","Kate Knill","Mark J. F. Gales"],"pdf_url":"https://arxiv.org/pdf/2306.13047v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07629v1","updated":"2023-10-11T16:18:13Z","published":"2023-10-11T16:18:13Z","title":"The Past, Present and Better Future of Feedback Learning in Large\n Language Models for Subjective Human Preferences and Values","summary":" Human feedback is increasingly used to steer the behaviours of Large Language\nModels (LLMs). However, it is unclear how to collect and incorporate feedback\nin a way that is efficient, effective and unbiased, especially for highly\nsubjective human preferences and values. In this paper, we survey existing\napproaches for learning from human feedback, drawing on 95 papers primarily\nfrom the ACL and arXiv repositories.First, we summarise the past, pre-LLM\ntrends for integrating human feedback into language models. Second, we give an\noverview of present techniques and practices, as well as the motivations for\nusing feedback; conceptual frameworks for defining values and preferences; and\nhow feedback is collected and from whom. Finally, we encourage a better future\nof feedback learning in LLMs by raising five unresolved conceptual and\npractical challenges.\n","authors":["Hannah Rose Kirk","Andrew M. Bean","Bertie Vidgen","Paul Röttger","Scott A. Hale"],"pdf_url":"https://arxiv.org/pdf/2310.07629v1.pdf","comment":"Accepted for the 2023 Conference on Empirical Methods in Natural\n Language Processing (EMNLP, Main)"},{"id":"http://arxiv.org/abs/2303.08268v3","updated":"2023-10-11T16:17:20Z","published":"2023-03-14T23:01:27Z","title":"Chat with the Environment: Interactive Multimodal Perception Using Large\n Language Models","summary":" Programming robot behavior in a complex world faces challenges on multiple\nlevels, from dextrous low-level skills to high-level planning and reasoning.\nRecent pre-trained Large Language Models (LLMs) have shown remarkable reasoning\nability in few-shot robotic planning. However, it remains challenging to ground\nLLMs in multimodal sensory input and continuous action output, while enabling a\nrobot to interact with its environment and acquire novel information as its\npolicies unfold. We develop a robot interaction scenario with a partially\nobservable state, which necessitates a robot to decide on a range of epistemic\nactions in order to sample sensory information among multiple modalities,\nbefore being able to execute the task correctly. Matcha (Multimodal environment\nchatting) agent, an interactive perception framework, is therefore proposed\nwith an LLM as its backbone, whose ability is exploited to instruct epistemic\nactions and to reason over the resulting multimodal sensations (vision, sound,\nhaptics, proprioception), as well as to plan an entire task execution based on\nthe interactively acquired information. Our study demonstrates that LLMs can\nprovide high-level planning and reasoning skills and control interactive robot\nbehavior in a multimodal environment, while multimodal modules with the context\nof the environmental state help ground the LLMs and extend their processing\nability. The project website can be found at https://matcha-agent.github.io.\n","authors":["Xufeng Zhao","Mengdi Li","Cornelius Weber","Muhammad Burhan Hafez","Stefan Wermter"],"pdf_url":"https://arxiv.org/pdf/2303.08268v3.pdf","comment":"IROS2023, Detroit. See the project website at\n https://matcha-agent.github.io"},{"id":"http://arxiv.org/abs/2310.07611v1","updated":"2023-10-11T15:56:00Z","published":"2023-10-11T15:56:00Z","title":"Democratizing LLMs: An Exploration of Cost-Performance Trade-offs in\n Self-Refined Open-Source Models","summary":" The dominance of proprietary LLMs has led to restricted access and raised\ninformation privacy concerns. High-performing open-source alternatives are\ncrucial for information-sensitive and high-volume applications but often lag\nbehind in performance. To address this gap, we propose (1) A untargeted variant\nof iterative self-critique and self-refinement devoid of external influence.\n(2) A novel ranking metric - Performance, Refinement, and Inference Cost Score\n(PeRFICS) - to find the optimal model for a given task considering refined\nperformance and cost. Our experiments show that SoTA open source models of\nvarying sizes from 7B - 65B, on average, improve 8.2% from their baseline\nperformance. Strikingly, even models with extremely small memory footprints,\nsuch as Vicuna-7B, show a 11.74% improvement overall and up to a 25.39%\nimprovement in high-creativity, open ended tasks on the Vicuna benchmark.\nVicuna-13B takes it a step further and outperforms ChatGPT post-refinement.\nThis work has profound implications for resource-constrained and\ninformation-sensitive environments seeking to leverage LLMs without incurring\nprohibitive costs, compromising on performance and privacy. The domain-agnostic\nself-refinement process coupled with our novel ranking metric facilitates\ninformed decision-making in model selection, thereby reducing costs and\ndemocratizing access to high-performing language models, as evidenced by case\nstudies.\n","authors":["Sumuk Shashidhar","Abhinav Chinta","Vaibhav Sahai","Zhenhailong Wang","Heng Ji"],"pdf_url":"https://arxiv.org/pdf/2310.07611v1.pdf","comment":"Initial Preprint. Accepted to Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.07609v1","updated":"2023-10-11T15:51:53Z","published":"2023-10-11T15:51:53Z","title":"QACHECK: A Demonstration System for Question-Guided Multi-Hop\n Fact-Checking","summary":" Fact-checking real-world claims often requires complex, multi-step reasoning\ndue to the absence of direct evidence to support or refute them. However,\nexisting fact-checking systems often lack transparency in their\ndecision-making, making it challenging for users to comprehend their reasoning\nprocess. To address this, we propose the Question-guided Multi-hop\nFact-Checking (QACHECK) system, which guides the model's reasoning process by\nasking a series of questions critical for verifying a claim. QACHECK has five\nkey modules: a claim verifier, a question generator, a question-answering\nmodule, a QA validator, and a reasoner. Users can input a claim into QACHECK,\nwhich then predicts its veracity and provides a comprehensive report detailing\nits reasoning process, guided by a sequence of (question, answer) pairs.\nQACHECK also provides the source of evidence supporting each question,\nfostering a transparent, explainable, and user-friendly fact-checking process.\nA recorded video of QACHECK is at https://www.youtube.com/watch?v=ju8kxSldM64\n","authors":["Liangming Pan","Xinyuan Lu","Min-Yen Kan","Preslav Nakov"],"pdf_url":"https://arxiv.org/pdf/2310.07609v1.pdf","comment":"Accepted at EMNLP 2023 System Demonstrations Track"},{"id":"http://arxiv.org/abs/2310.07588v1","updated":"2023-10-11T15:28:44Z","published":"2023-10-11T15:28:44Z","title":"Accurate Use of Label Dependency in Multi-Label Text Classification\n Through the Lens of Causality","summary":" Multi-Label Text Classification (MLTC) aims to assign the most relevant\nlabels to each given text. Existing methods demonstrate that label dependency\ncan help to improve the model's performance. However, the introduction of label\ndependency may cause the model to suffer from unwanted prediction bias. In this\nstudy, we attribute the bias to the model's misuse of label dependency, i.e.,\nthe model tends to utilize the correlation shortcut in label dependency rather\nthan fusing text information and label dependency for prediction. Motivated by\ncausal inference, we propose a CounterFactual Text Classifier (CFTC) to\neliminate the correlation bias, and make causality-based predictions.\nSpecifically, our CFTC first adopts the predict-then-modify backbone to extract\nprecise label information embedded in label dependency, then blocks the\ncorrelation shortcut through the counterfactual de-bias technique with the help\nof the human causal graph. Experimental results on three datasets demonstrate\nthat our CFTC significantly outperforms the baselines and effectively\neliminates the correlation bias in datasets.\n","authors":["Caoyun Fan","Wenqing Chen","Jidong Tian","Yitian Li","Hao He","Yaohui Jin"],"pdf_url":"https://arxiv.org/pdf/2310.07588v1.pdf","comment":"Applied Intelligence 2023"},{"id":"http://arxiv.org/abs/2304.14933v2","updated":"2023-10-11T15:08:51Z","published":"2023-04-28T15:43:21Z","title":"An Empirical Study of Multimodal Model Merging","summary":" Model merging (e.g., via interpolation or task arithmetic) fuses multiple\nmodels trained on different tasks to generate a multi-task solution. The\ntechnique has been proven successful in previous studies, where the models are\ntrained on similar tasks and with the same initialization. In this paper, we\nexpand on this concept to a multimodal setup by merging transformers trained on\ndifferent modalities. Furthermore, we conduct our study for a novel goal where\nwe can merge vision, language, and cross-modal transformers of a\nmodality-specific architecture to create a parameter-efficient\nmodality-agnostic architecture. Through comprehensive experiments, we\nsystematically investigate the key factors impacting model performance after\nmerging, including initialization, merging mechanisms, and model architectures.\nWe also propose two metrics that assess the distance between weights to be\nmerged and can serve as an indicator of the merging outcomes. Our analysis\nleads to an effective training recipe for matching the performance of the\nmodality-agnostic baseline (i.e., pre-trained from scratch) via model merging.\nOur method also outperforms naive merging significantly on various tasks, with\nimprovements of 3% on VQA, 7% on COCO retrieval, 25% on NLVR2, 14% on Flickr30k\nand 3% on ADE20k. Our code is available at https://github.com/ylsung/vl-merging\n","authors":["Yi-Lin Sung","Linjie Li","Kevin Lin","Zhe Gan","Mohit Bansal","Lijuan Wang"],"pdf_url":"https://arxiv.org/pdf/2304.14933v2.pdf","comment":"EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2308.12067v2","updated":"2023-10-11T14:49:26Z","published":"2023-08-23T11:27:30Z","title":"InstructionGPT-4: A 200-Instruction Paradigm for Fine-Tuning MiniGPT-4","summary":" Multimodal large language models are typically trained in two stages: first\npre-training on image-text pairs, and then fine-tuning using supervised\nvision-language instruction data. Recent studies have shown that large language\nmodels can achieve satisfactory results even with a limited amount of\nhigh-quality instruction-following data. In this paper, we introduce\nInstructionGPT-4, which is fine-tuned on a small dataset comprising only 200\nexamples, amounting to approximately 6\\% of the instruction-following data used\nin the alignment dataset for MiniGPT-4. To achieve this, we first propose\nseveral metrics to access the quality of multimodal instruction data. Based on\nthese metrics, we present an effective and trainable data selector to\nautomatically identify and filter low-quality vision-language data. By\nemploying this method, InstructionGPT-4 outperforms the original MiniGPT-4 on\nvarious evaluations. Overall, our findings demonstrate that less but\nhigh-quality instruction tuning data is efficient in enabling multimodal large\nlanguage models to generate better output. Our code is available at\nhttps://github.com/waltonfuture/InstructionGPT-4.\n","authors":["Lai Wei","Zihao Jiang","Weiran Huang","Lichao Sun"],"pdf_url":"https://arxiv.org/pdf/2308.12067v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07521v1","updated":"2023-10-11T14:18:03Z","published":"2023-10-11T14:18:03Z","title":"Survey on Factuality in Large Language Models: Knowledge, Retrieval and\n Domain-Specificity","summary":" This survey addresses the crucial issue of factuality in Large Language\nModels (LLMs). As LLMs find applications across diverse domains, the\nreliability and accuracy of their outputs become vital. We define the\nFactuality Issue as the probability of LLMs to produce content inconsistent\nwith established facts. We first delve into the implications of these\ninaccuracies, highlighting the potential consequences and challenges posed by\nfactual errors in LLM outputs. Subsequently, we analyze the mechanisms through\nwhich LLMs store and process facts, seeking the primary causes of factual\nerrors. Our discussion then transitions to methodologies for evaluating LLM\nfactuality, emphasizing key metrics, benchmarks, and studies. We further\nexplore strategies for enhancing LLM factuality, including approaches tailored\nfor specific domains. We focus two primary LLM configurations standalone LLMs\nand Retrieval-Augmented LLMs that utilizes external data, we detail their\nunique challenges and potential enhancements. Our survey offers a structured\nguide for researchers aiming to fortify the factual reliability of LLMs.\n","authors":["Cunxiang Wang","Xiaoze Liu","Yuanhao Yue","Xiangru Tang","Tianhang Zhang","Cheng Jiayang","Yunzhi Yao","Wenyang Gao","Xuming Hu","Zehan Qi","Yidong Wang","Linyi Yang","Jindong Wang","Xing Xie","Zheng Zhang","Yue Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.07521v1.pdf","comment":"43 pages; 300+ references"},{"id":"http://arxiv.org/abs/2310.07488v1","updated":"2023-10-11T13:35:05Z","published":"2023-10-11T13:35:05Z","title":"KwaiYiiMath: Technical Report","summary":" Recent advancements in large language models (LLMs) have demonstrated\nremarkable abilities in handling a variety of natural language processing (NLP)\ndownstream tasks, even on mathematical tasks requiring multi-step reasoning. In\nthis report, we introduce the KwaiYiiMath which enhances the mathematical\nreasoning abilities of KwaiYiiBase1, by applying Supervised Fine-Tuning (SFT)\nand Reinforced Learning from Human Feedback (RLHF), including on both English\nand Chinese mathematical tasks. Meanwhile, we also constructed a small-scale\nChinese primary school mathematics test set (named KMath), consisting of 188\nexamples to evaluate the correctness of the problem-solving process generated\nby the models. Empirical studies demonstrate that KwaiYiiMath can achieve\nstate-of-the-art (SOTA) performance on GSM8k, CMath, and KMath compared with\nthe similar size models, respectively.\n","authors":["Jiayi Fu","Lei Lin","Xiaoyang Gao","Pengli Liu","Zhengzong Chen","Zhirui Yang","Shengnan Zhang","Xue Zheng","Yan Li","Yuliang Liu","Xucheng Ye","Yiqiao Liao","Chao Liao","Bin Chen","Chengru Song","Junchen Wan","Zijia Lin","Fuzheng Zhang","Zhongyuan Wang","Di Zhang","Kun Gai"],"pdf_url":"https://arxiv.org/pdf/2310.07488v1.pdf","comment":"technical report"},{"id":"http://arxiv.org/abs/2310.07487v1","updated":"2023-10-11T13:34:22Z","published":"2023-10-11T13:34:22Z","title":"Cognate Transformer for Automated Phonological Reconstruction and\n Cognate Reflex Prediction","summary":" Phonological reconstruction is one of the central problems in historical\nlinguistics where a proto-word of an ancestral language is determined from the\nobserved cognate words of daughter languages. Computational approaches to\nhistorical linguistics attempt to automate the task by learning models on\navailable linguistic data. Several ideas and techniques drawn from\ncomputational biology have been successfully applied in the area of\ncomputational historical linguistics. Following these lines, we adapt MSA\nTransformer, a protein language model, to the problem of automated phonological\nreconstruction. MSA Transformer trains on multiple sequence alignments as input\nand is, thus, apt for application on aligned cognate words. We, hence, name our\nmodel as Cognate Transformer. We also apply the model on another associated\ntask, namely, cognate reflex prediction, where a reflex word in a daughter\nlanguage is predicted based on cognate words from other daughter languages. We\nshow that our model outperforms the existing models on both tasks, especially\nwhen it is pre-trained on masked word prediction task.\n","authors":["V. S. D. S. Mahesh Akavarapu","Arnab Bhattacharya"],"pdf_url":"https://arxiv.org/pdf/2310.07487v1.pdf","comment":"Accepted to appear at the conference of EMNLP-2023"},{"id":"http://arxiv.org/abs/2310.07423v1","updated":"2023-10-11T12:15:24Z","published":"2023-10-11T12:15:24Z","title":"Adapting the adapters for code-switching in multilingual ASR","summary":" Recently, large pre-trained multilingual speech models have shown potential\nin scaling Automatic Speech Recognition (ASR) to many low-resource languages.\nSome of these models employ language adapters in their formulation, which helps\nto improve monolingual performance and avoids some of the drawbacks of\nmulti-lingual modeling on resource-rich languages. However, this formulation\nrestricts the usability of these models on code-switched speech, where two\nlanguages are mixed together in the same utterance. In this work, we propose\nways to effectively fine-tune such models on code-switched speech, by\nassimilating information from both language adapters at each language\nadaptation point in the network. We also model code-switching as a sequence of\nlatent binary sequences that can be used to guide the flow of information from\neach language adapter at the frame level. The proposed approaches are evaluated\non three code-switched datasets encompassing Arabic, Mandarin, and Hindi\nlanguages paired with English, showing consistent improvements in\ncode-switching performance with at least 10\\% absolute reduction in CER across\nall test sets.\n","authors":["Atharva Kulkarni","Ajinkya Kulkarni","Miguel Couceiro","Hanan Aldarmaki"],"pdf_url":"https://arxiv.org/pdf/2310.07423v1.pdf","comment":"Submitted to ICASSP 2024"},{"id":"http://arxiv.org/abs/2310.07403v1","updated":"2023-10-11T11:39:36Z","published":"2023-10-11T11:39:36Z","title":"DASpeech: Directed Acyclic Transformer for Fast and High-quality\n Speech-to-Speech Translation","summary":" Direct speech-to-speech translation (S2ST) translates speech from one\nlanguage into another using a single model. However, due to the presence of\nlinguistic and acoustic diversity, the target speech follows a complex\nmultimodal distribution, posing challenges to achieving both high-quality\ntranslations and fast decoding speeds for S2ST models. In this paper, we\npropose DASpeech, a non-autoregressive direct S2ST model which realizes both\nfast and high-quality S2ST. To better capture the complex distribution of the\ntarget speech, DASpeech adopts the two-pass architecture to decompose the\ngeneration process into two steps, where a linguistic decoder first generates\nthe target text, and an acoustic decoder then generates the target speech based\non the hidden states of the linguistic decoder. Specifically, we use the\ndecoder of DA-Transformer as the linguistic decoder, and use FastSpeech 2 as\nthe acoustic decoder. DA-Transformer models translations with a directed\nacyclic graph (DAG). To consider all potential paths in the DAG during\ntraining, we calculate the expected hidden states for each target token via\ndynamic programming, and feed them into the acoustic decoder to predict the\ntarget mel-spectrogram. During inference, we select the most probable path and\ntake hidden states on that path as input to the acoustic decoder. Experiments\non the CVSS Fr-En benchmark demonstrate that DASpeech can achieve comparable or\neven better performance than the state-of-the-art S2ST model Translatotron 2,\nwhile preserving up to 18.53x speedup compared to the autoregressive baseline.\nCompared with the previous non-autoregressive S2ST model, DASpeech does not\nrely on knowledge distillation and iterative decoding, achieving significant\nimprovements in both translation quality and decoding speed. Furthermore,\nDASpeech shows the ability to preserve the speaker's voice of the source speech\nduring translation.\n","authors":["Qingkai Fang","Yan Zhou","Yang Feng"],"pdf_url":"https://arxiv.org/pdf/2310.07403v1.pdf","comment":"NeurIPS 2023. Audio samples are available at\n https://ictnlp.github.io/daspeech-demo/"},{"id":"http://arxiv.org/abs/2310.07397v1","updated":"2023-10-11T11:32:57Z","published":"2023-10-11T11:32:57Z","title":"Target-oriented Proactive Dialogue Systems with Personalization: Problem\n Formulation and Dataset Curation","summary":" Target-oriented dialogue systems, designed to proactively steer conversations\ntoward predefined targets or accomplish specific system-side goals, are an\nexciting area in conversational AI. In this work, by formulating a pair as the conversation target, we explore a novel problem of\npersonalized target-oriented dialogue by considering personalization during the\ntarget accomplishment process. However, there remains an emergent need for\nhigh-quality datasets, and building one from scratch requires tremendous human\neffort. To address this, we propose an automatic dataset curation framework\nusing a role-playing approach. Based on this framework, we construct a\nlarge-scale personalized target-oriented dialogue dataset, TopDial, which\ncomprises about 18K multi-turn dialogues. The experimental results show that\nthis dataset is of high quality and could contribute to exploring personalized\ntarget-oriented dialogue.\n","authors":["Jian Wang","Yi Cheng","Dongding Lin","Chak Tou Leong","Wenjie Li"],"pdf_url":"https://arxiv.org/pdf/2310.07397v1.pdf","comment":"Accepted to EMNLP-2023 main conference"},{"id":"http://arxiv.org/abs/2310.07387v1","updated":"2023-10-11T11:08:20Z","published":"2023-10-11T11:08:20Z","title":"Linguistic laws in biology","summary":" Linguistic laws, the common statistical patterns of human language, have been\ninvestigated by quantitative linguists for nearly a century. Recently,\nbiologists from a range of disciplines have started to explore the prevalence\nof these laws beyond language, finding patterns consistent with linguistic laws\nacross multiple levels of biological organisation, from molecular (genomes,\ngenes, and proteins) to organismal (animal behaviour) to ecological\n(populations and ecosystems). We propose a new conceptual framework for the\nstudy of linguistic laws in biology, comprising and integrating distinct levels\nof analysis, from description to prediction to theory building. Adopting this\nframework will provide critical new insights into the fundamental rules of\norganisation underpinning natural systems, unifying linguistic laws and core\ntheory in biology.\n","authors":["Stuart Semple","Ramon Ferrer-i-Cancho","Morgan L. Gustison"],"pdf_url":"https://arxiv.org/pdf/2310.07387v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.08732v3","updated":"2023-10-11T10:51:12Z","published":"2023-05-15T15:47:09Z","title":"Knowledge Rumination for Pre-trained Language Models","summary":" Previous studies have revealed that vanilla pre-trained language models\n(PLMs) lack the capacity to handle knowledge-intensive NLP tasks alone; thus,\nseveral works have attempted to integrate external knowledge into PLMs.\nHowever, despite the promising outcome, we empirically observe that PLMs may\nhave already encoded rich knowledge in their pre-trained parameters but fail to\nfully utilize them when applying them to knowledge-intensive tasks. In this\npaper, we propose a new paradigm dubbed Knowledge Rumination to help the\npre-trained language model utilize that related latent knowledge without\nretrieving it from the external corpus. By simply adding a prompt like \"As far\nas I know\" to the PLMs, we try to review related latent knowledge and inject\nthem back into the model for knowledge consolidation. We apply the proposed\nknowledge rumination to various language models, including RoBERTa, DeBERTa,\nand GPT-3. Experimental results on six commonsense reasoning tasks and GLUE\nbenchmarks demonstrate the effectiveness of our proposed approach, which proves\nthat the knowledge stored in PLMs can be better exploited to enhance\nperformance. Code is available in\nhttps://github.com/zjunlp/knowledge-rumination.\n","authors":["Yunzhi Yao","Peng Wang","Shengyu Mao","Chuanqi Tan","Fei Huang","Huajun Chen","Ningyu Zhang"],"pdf_url":"https://arxiv.org/pdf/2305.08732v3.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.06692v2","updated":"2023-10-11T10:05:29Z","published":"2023-10-10T15:10:03Z","title":"Meta-CoT: Generalizable Chain-of-Thought Prompting in Mixed-task\n Scenarios with Large Language Models","summary":" Large language models (LLMs) have unveiled remarkable reasoning capabilities\nby exploiting chain-of-thought (CoT) prompting, which generates intermediate\nreasoning chains to serve as the rationale for deriving the answer. However,\ncurrent CoT methods either simply employ general prompts such as Let's think\nstep by step, or heavily rely on handcrafted task-specific demonstrations to\nattain preferable performances, thereby engendering an inescapable gap between\nperformance and generalization. To bridge this gap, we propose Meta-CoT, a\ngeneralizable CoT prompting method in mixed-task scenarios where the type of\ninput questions is unknown. Meta-CoT firstly categorizes the scenario based on\nthe input question and subsequently constructs diverse demonstrations from the\ncorresponding data pool in an automatic pattern. Meta-CoT simultaneously enjoys\nremarkable performances on ten public benchmark reasoning tasks and superior\ngeneralization capabilities. Notably, Meta-CoT achieves the state-of-the-art\nresult on SVAMP (93.7%) without any additional program-aided methods. Our\nfurther experiments on five out-of-distribution datasets verify the stability\nand generality of Meta-CoT.\n","authors":["Anni Zou","Zhuosheng Zhang","Hai Zhao","Xiangru Tang"],"pdf_url":"https://arxiv.org/pdf/2310.06692v2.pdf","comment":"17 pages, 7 figures"},{"id":"http://arxiv.org/abs/2310.07347v1","updated":"2023-10-11T09:55:46Z","published":"2023-10-11T09:55:46Z","title":"Fast-ELECTRA for Efficient Pre-training","summary":" ELECTRA pre-trains language models by detecting tokens in a sequence that\nhave been replaced by an auxiliary model. Although ELECTRA offers a significant\nboost in efficiency, its potential is constrained by the training cost brought\nby the auxiliary model. Notably, this model, which is jointly trained with the\nmain model, only serves to assist the training of the main model and is\ndiscarded post-training. This results in a substantial amount of training cost\nbeing expended in vain. To mitigate this issue, we propose Fast-ELECTRA, which\nleverages an existing language model as the auxiliary model. To construct a\nlearning curriculum for the main model, we smooth its output distribution via\ntemperature scaling following a descending schedule. Our approach rivals the\nperformance of state-of-the-art ELECTRA-style pre-training methods, while\nsignificantly eliminating the computation and memory cost brought by the joint\ntraining of the auxiliary model. Our method also reduces the sensitivity to\nhyper-parameters and enhances the pre-training stability.\n","authors":["Chengyu Dong","Liyuan Liu","Hao Cheng","Jingbo Shang","Jianfeng Gao","Xiaodong Liu"],"pdf_url":"https://arxiv.org/pdf/2310.07347v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07345v1","updated":"2023-10-11T09:53:17Z","published":"2023-10-11T09:53:17Z","title":"Investigating the Effect of Language Models in Sequence Discriminative\n Training for Neural Transducers","summary":" In this work, we investigate the effect of language models (LMs) with\ndifferent context lengths and label units (phoneme vs. word) used in sequence\ndiscriminative training for phoneme-based neural transducers. Both lattice-free\nand N-best-list approaches are examined. For lattice-free methods with\nphoneme-level LMs, we propose a method to approximate the context history to\nemploy LMs with full-context dependency. This approximation can be extended to\narbitrary context length and enables the usage of word-level LMs in\nlattice-free methods. Moreover, a systematic comparison is conducted across\nlattice-free and N-best-list-based methods. Experimental results on Librispeech\nshow that using the word-level LM in training outperforms the phoneme-level LM.\nBesides, we find that the context size of the LM used for probability\ncomputation has a limited effect on performance. Moreover, our results reveal\nthe pivotal importance of the hypothesis space quality in sequence\ndiscriminative training.\n","authors":["Zijian Yang","Wei Zhou","Ralf Schlüter","Hermann Ney"],"pdf_url":"https://arxiv.org/pdf/2310.07345v1.pdf","comment":"accepted at ASRU 2023"},{"id":"http://arxiv.org/abs/2304.14767v2","updated":"2023-10-11T09:49:59Z","published":"2023-04-28T11:26:17Z","title":"Dissecting Recall of Factual Associations in Auto-Regressive Language\n Models","summary":" Transformer-based language models (LMs) are known to capture factual\nknowledge in their parameters. While previous work looked into where factual\nassociations are stored, only little is known about how they are retrieved\ninternally during inference. We investigate this question through the lens of\ninformation flow. Given a subject-relation query, we study how the model\naggregates information about the subject and relation to predict the correct\nattribute. With interventions on attention edges, we first identify two\ncritical points where information propagates to the prediction: one from the\nrelation positions followed by another from the subject positions. Next, by\nanalyzing the information at these points, we unveil a three-step internal\nmechanism for attribute extraction. First, the representation at the\nlast-subject position goes through an enrichment process, driven by the early\nMLP sublayers, to encode many subject-related attributes. Second, information\nfrom the relation propagates to the prediction. Third, the prediction\nrepresentation \"queries\" the enriched subject to extract the attribute. Perhaps\nsurprisingly, this extraction is typically done via attention heads, which\noften encode subject-attribute mappings in their parameters. Overall, our\nfindings introduce a comprehensive view of how factual associations are stored\nand extracted internally in LMs, facilitating future research on knowledge\nlocalization and editing.\n","authors":["Mor Geva","Jasmijn Bastings","Katja Filippova","Amir Globerson"],"pdf_url":"https://arxiv.org/pdf/2304.14767v2.pdf","comment":"Accepted at EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.07343v1","updated":"2023-10-11T09:46:32Z","published":"2023-10-11T09:46:32Z","title":"How Do Large Language Models Capture the Ever-changing World Knowledge?\n A Review of Recent Advances","summary":" Although large language models (LLMs) are impressive in solving various\ntasks, they can quickly be outdated after deployment. Maintaining their\nup-to-date status is a pressing concern in the current era. This paper provides\na comprehensive review of recent advances in aligning LLMs with the\never-changing world knowledge without re-training from scratch. We categorize\nresearch works systemically and provide in-depth comparisons and discussion. We\nalso discuss existing challenges and highlight future directions to facilitate\nresearch in this field. We release the paper list at\nhttps://github.com/hyintell/awesome-refreshing-llms\n","authors":["Zihan Zhang","Meng Fang","Ling Chen","Mohammad-Reza Namazi-Rad","Jun Wang"],"pdf_url":"https://arxiv.org/pdf/2310.07343v1.pdf","comment":"EMNLP 2023 main conference, paper link at\n https://github.com/hyintell/awesome-refreshing-llms"},{"id":"http://arxiv.org/abs/2304.11082v4","updated":"2023-10-11T09:45:15Z","published":"2023-04-19T17:50:09Z","title":"Fundamental Limitations of Alignment in Large Language Models","summary":" An important aspect in developing language models that interact with humans\nis aligning their behavior to be useful and unharmful for their human users.\nThis is usually achieved by tuning the model in a way that enhances desired\nbehaviors and inhibits undesired ones, a process referred to as alignment. In\nthis paper, we propose a theoretical approach called Behavior Expectation\nBounds (BEB) which allows us to formally investigate several inherent\ncharacteristics and limitations of alignment in large language models.\nImportantly, we prove that within the limits of this framework, for any\nbehavior that has a finite probability of being exhibited by the model, there\nexist prompts that can trigger the model into outputting this behavior, with\nprobability that increases with the length of the prompt. This implies that any\nalignment process that attenuates an undesired behavior but does not remove it\naltogether, is not safe against adversarial prompting attacks. Furthermore, our\nframework hints at the mechanism by which leading alignment approaches such as\nreinforcement learning from human feedback make the LLM prone to being prompted\ninto the undesired behaviors. This theoretical result is being experimentally\ndemonstrated in large scale by the so called contemporary \"chatGPT jailbreaks\",\nwhere adversarial users trick the LLM into breaking its alignment guardrails by\ntriggering it into acting as a malicious persona. Our results expose\nfundamental limitations in alignment of LLMs and bring to the forefront the\nneed to devise reliable mechanisms for ensuring AI safety.\n","authors":["Yotam Wolf","Noam Wies","Oshri Avnery","Yoav Levine","Amnon Shashua"],"pdf_url":"https://arxiv.org/pdf/2304.11082v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07328v1","updated":"2023-10-11T09:18:09Z","published":"2023-10-11T09:18:09Z","title":"An Empirical Study of Instruction-tuning Large Language Models in\n Chinese","summary":" The success of ChatGPT validates the potential of large language models\n(LLMs) in artificial general intelligence (AGI). Subsequently, the release of\nLLMs has sparked the open-source community's interest in instruction-tuning,\nwhich is deemed to accelerate ChatGPT's replication process. However, research\non instruction-tuning LLMs in Chinese, the world's most spoken language, is\nstill in its early stages. Therefore, this paper makes an in-depth empirical\nstudy of instruction-tuning LLMs in Chinese, which can serve as a cookbook that\nprovides valuable findings for effectively customizing LLMs that can better\nrespond to Chinese instructions. Specifically, we systematically explore the\nimpact of LLM bases, parameter-efficient methods, instruction data types, which\nare the three most important elements for instruction-tuning. Besides, we also\nconduct experiment to study the impact of other factors, e.g., chain-of-thought\ndata and human-value alignment. We hope that this empirical study can make a\nmodest contribution to the open Chinese version of ChatGPT. This paper will\nrelease a powerful Chinese LLMs that is comparable to ChatGLM. The code and\ndata are available at https://github.com/PhoebusSi/Alpaca-CoT.\n","authors":["Qingyi Si","Tong Wang","Zheng Lin","Xu Zhang","Yanan Cao","Weiping Wang"],"pdf_url":"https://arxiv.org/pdf/2310.07328v1.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.07321v1","updated":"2023-10-11T09:09:55Z","published":"2023-10-11T09:09:55Z","title":"On the Impact of Cross-Domain Data on German Language Models","summary":" Traditionally, large language models have been either trained on general web\ncrawls or domain-specific data. However, recent successes of generative large\nlanguage models, have shed light on the benefits of cross-domain datasets. To\nexamine the significance of prioritizing data diversity over quality, we\npresent a German dataset comprising texts from five domains, along with another\ndataset aimed at containing high-quality data. Through training a series of\nmodels ranging between 122M and 750M parameters on both datasets, we conduct a\ncomprehensive benchmark on multiple downstream tasks. Our findings demonstrate\nthat the models trained on the cross-domain dataset outperform those trained on\nquality data alone, leading to improvements up to $4.45\\%$ over the previous\nstate-of-the-art. The models are available at\nhttps://huggingface.co/ikim-uk-essen\n","authors":["Amin Dada","Aokun Chen","Cheng Peng","Kaleb E Smith","Ahmad Idrissi-Yaghir","Constantin Marc Seibold","Jianning Li","Lars Heiliger","Christoph M. Friedrich","Daniel Truhn","Jan Egger","Jiang Bian","Jens Kleesiek","Yonghui Wu"],"pdf_url":"https://arxiv.org/pdf/2310.07321v1.pdf","comment":"13 pages, 1 figure, accepted at Findings of the Association for\n Computational Linguistics: EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.07306v1","updated":"2023-10-11T08:40:06Z","published":"2023-10-11T08:40:06Z","title":"SNOiC: Soft Labeling and Noisy Mixup based Open Intent Classification\n Model","summary":" This paper presents a Soft Labeling and Noisy Mixup-based open intent\nclassification model (SNOiC). Most of the previous works have used\nthreshold-based methods to identify open intents, which are prone to\noverfitting and may produce biased predictions. Additionally, the need for more\navailable data for an open intent class presents another limitation for these\nexisting models. SNOiC combines Soft Labeling and Noisy Mixup strategies to\nreduce the biasing and generate pseudo-data for open intent class. The\nexperimental results on four benchmark datasets show that the SNOiC model\nachieves a minimum and maximum performance of 68.72\\% and 94.71\\%,\nrespectively, in identifying open intents. Moreover, compared to\nstate-of-the-art models, the SNOiC model improves the performance of\nidentifying open intents by 0.93\\% (minimum) and 12.76\\% (maximum). The model's\nefficacy is further established by analyzing various parameters used in the\nproposed model. An ablation study is also conducted, which involves creating\nthree model variants to validate the effectiveness of the SNOiC model.\n","authors":["Aditi Kanwar","Aditi Seetha","Satyendra Singh Chouhan","Rajdeep Niyogi"],"pdf_url":"https://arxiv.org/pdf/2310.07306v1.pdf","comment":"9 Pages, 6 figures"},{"id":"http://arxiv.org/abs/2310.07301v1","updated":"2023-10-11T08:36:43Z","published":"2023-10-11T08:36:43Z","title":"Parrot: Enhancing Multi-Turn Chat Models by Learning to Ask Questions","summary":" Impressive progress has been made on chat models based on Large Language\nModels (LLMs) recently; however, there is a noticeable lag in multi-turn\nconversations between open-source chat models (e.g., Alpaca and Vicuna) and the\nleading chat models (e.g., ChatGPT and GPT-4). Through a series of analyses, we\nattribute the lag to the lack of enough high-quality multi-turn\ninstruction-tuning data. The available instruction-tuning data for the\ncommunity are either single-turn conversations or multi-turn ones with certain\nissues, such as non-human-like instructions, less detailed responses, or rare\ntopic shifts. In this paper, we address these challenges by introducing Parrot,\na highly scalable solution designed to automatically generate high-quality\ninstruction-tuning data, which are then used to enhance the effectiveness of\nchat models in multi-turn conversations. Specifically, we start by training the\nParrot-Ask model, which is designed to emulate real users in generating\ninstructions. We then utilize Parrot-Ask to engage in multi-turn conversations\nwith ChatGPT across a diverse range of topics, resulting in a collection of 40K\nhigh-quality multi-turn dialogues (Parrot-40K). These data are subsequently\nemployed to train a chat model that we have named Parrot-Chat. We demonstrate\nthat the dialogues gathered from Parrot-Ask markedly outperform existing\nmulti-turn instruction-following datasets in critical metrics, including topic\ndiversity, number of turns, and resemblance to human conversation. With only\n40K training examples, Parrot-Chat achieves strong performance against other\n13B open-source models across a range of instruction-following benchmarks, and\nparticularly excels in evaluations of multi-turn capabilities. We make all\ncodes, datasets, and two versions of the Parrot-Ask model based on LLaMA2-13B\nand KuaiYii-13B available at https://github.com/kwai/KwaiYii/Parrot.\n","authors":["Yuchong Sun","Che Liu","Jinwen Huang","Ruihua Song","Fuzheng Zhang","Di Zhang","Zhongyuan Wang","Kun Gai"],"pdf_url":"https://arxiv.org/pdf/2310.07301v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.07678v2","updated":"2023-10-11T08:34:42Z","published":"2023-03-14T07:27:30Z","title":"Query2doc: Query Expansion with Large Language Models","summary":" This paper introduces a simple yet effective query expansion approach,\ndenoted as query2doc, to improve both sparse and dense retrieval systems. The\nproposed method first generates pseudo-documents by few-shot prompting large\nlanguage models (LLMs), and then expands the query with generated\npseudo-documents. LLMs are trained on web-scale text corpora and are adept at\nknowledge memorization. The pseudo-documents from LLMs often contain highly\nrelevant information that can aid in query disambiguation and guide the\nretrievers. Experimental results demonstrate that query2doc boosts the\nperformance of BM25 by 3% to 15% on ad-hoc IR datasets, such as MS-MARCO and\nTREC DL, without any model fine-tuning. Furthermore, our method also benefits\nstate-of-the-art dense retrievers in terms of both in-domain and out-of-domain\nresults.\n","authors":["Liang Wang","Nan Yang","Furu Wei"],"pdf_url":"https://arxiv.org/pdf/2303.07678v2.pdf","comment":"Accepted to EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.07299v1","updated":"2023-10-11T08:33:23Z","published":"2023-10-11T08:33:23Z","title":"RobustGEC: Robust Grammatical Error Correction Against Subtle Context\n Perturbation","summary":" Grammatical Error Correction (GEC) systems play a vital role in assisting\npeople with their daily writing tasks. However, users may sometimes come across\na GEC system that initially performs well but fails to correct errors when the\ninputs are slightly modified. To ensure an ideal user experience, a reliable\nGEC system should have the ability to provide consistent and accurate\nsuggestions when encountering irrelevant context perturbations, which we refer\nto as context robustness. In this paper, we introduce RobustGEC, a benchmark\ndesigned to evaluate the context robustness of GEC systems. RobustGEC comprises\n5,000 GEC cases, each with one original error-correct sentence pair and five\nvariants carefully devised by human annotators. Utilizing RobustGEC, we reveal\nthat state-of-the-art GEC systems still lack sufficient robustness against\ncontext perturbations. In addition, we propose a simple yet effective method\nfor remitting this issue.\n","authors":["Yue Zhang","Leyang Cui","Enbo Zhao","Wei Bi","Shuming Shi"],"pdf_url":"https://arxiv.org/pdf/2310.07299v1.pdf","comment":"Accepted to EMNLP 2023 (main conference, long paper)"},{"id":"http://arxiv.org/abs/2310.07289v1","updated":"2023-10-11T08:22:37Z","published":"2023-10-11T08:22:37Z","title":"Beyond Factuality: A Comprehensive Evaluation of Large Language Models\n as Knowledge Generators","summary":" Large language models (LLMs) outperform information retrieval techniques for\ndownstream knowledge-intensive tasks when being prompted to generate world\nknowledge. However, community concerns abound regarding the factuality and\npotential implications of using this uncensored knowledge. In light of this, we\nintroduce CONNER, a COmpreheNsive kNowledge Evaluation fRamework, designed to\nsystematically and automatically evaluate generated knowledge from six\nimportant perspectives -- Factuality, Relevance, Coherence, Informativeness,\nHelpfulness and Validity. We conduct an extensive empirical analysis of the\ngenerated knowledge from three different types of LLMs on two widely studied\nknowledge-intensive tasks, i.e., open-domain question answering and\nknowledge-grounded dialogue. Surprisingly, our study reveals that the\nfactuality of generated knowledge, even if lower, does not significantly hinder\ndownstream tasks. Instead, the relevance and coherence of the outputs are more\nimportant than small factual mistakes. Further, we show how to use CONNER to\nimprove knowledge-intensive tasks by designing two strategies: Prompt\nEngineering and Knowledge Selection. Our evaluation code and LLM-generated\nknowledge with human annotations will be released to facilitate future\nresearch.\n","authors":["Liang Chen","Yang Deng","Yatao Bian","Zeyu Qin","Bingzhe Wu","Tat-Seng Chua","Kam-Fai Wong"],"pdf_url":"https://arxiv.org/pdf/2310.07289v1.pdf","comment":"Accepted to EMNLP 2023 main conference"},{"id":"http://arxiv.org/abs/2310.07284v1","updated":"2023-10-11T08:17:54Z","published":"2023-10-11T08:17:54Z","title":"Typing to Listen at the Cocktail Party: Text-Guided Target Speaker\n Extraction","summary":" Humans possess an extraordinary ability to selectively focus on the sound\nsource of interest amidst complex acoustic environments, commonly referred to\nas cocktail party scenarios. In an attempt to replicate this remarkable\nauditory attention capability in machines, target speaker extraction (TSE)\nmodels have been developed. These models leverage the pre-registered cues of\nthe target speaker to extract the sound source of interest. However, the\neffectiveness of these models is hindered in real-world scenarios due to the\npotential variation or even absence of pre-registered cues. To address this\nlimitation, this study investigates the integration of natural language to\nenhance the flexibility and controllability of existing TSE models.\nSpecifically, we propose a model named LLM-TSE, wherein a large language model\n(LLM) to extract useful semantic cues from the user's typed text input, which\ncan complement the pre-registered cues or work independently to control the TSE\nprocess. Our experimental results demonstrate competitive performance when only\ntext-based cues are presented, and a new state-of-the-art is set when combined\nwith pre-registered acoustic cues. To the best of our knowledge, this is the\nfirst work that has successfully incorporated text-based cues to guide target\nspeaker extraction, which can be a cornerstone for cocktail party problem\nresearch.\n","authors":["Xiang Hao","Jibin Wu","Jianwei Yu","Chenglin Xu","Kay Chen Tan"],"pdf_url":"https://arxiv.org/pdf/2310.07284v1.pdf","comment":"Under review"},{"id":"http://arxiv.org/abs/2310.07282v1","updated":"2023-10-11T08:16:35Z","published":"2023-10-11T08:16:35Z","title":"An Analysis on Large Language Models in Healthcare: A Case Study of\n BioBERT","summary":" This paper conducts a comprehensive investigation into applying large\nlanguage models, particularly on BioBERT, in healthcare. It begins with\nthoroughly examining previous natural language processing (NLP) approaches in\nhealthcare, shedding light on the limitations and challenges these methods\nface. Following that, this research explores the path that led to the\nincorporation of BioBERT into healthcare applications, highlighting its\nsuitability for addressing the specific requirements of tasks related to\nbiomedical text mining. The analysis outlines a systematic methodology for\nfine-tuning BioBERT to meet the unique needs of the healthcare domain. This\napproach includes various components, including the gathering of data from a\nwide range of healthcare sources, data annotation for tasks like identifying\nmedical entities and categorizing them, and the application of specialized\npreprocessing techniques tailored to handle the complexities found in\nbiomedical texts. Additionally, the paper covers aspects related to model\nevaluation, with a focus on healthcare benchmarks and functions like processing\nof natural language in biomedical, question-answering, clinical document\nclassification, and medical entity recognition. It explores techniques to\nimprove the model's interpretability and validates its performance compared to\nexisting healthcare-focused language models. The paper thoroughly examines\nethical considerations, particularly patient privacy and data security. It\nhighlights the benefits of incorporating BioBERT into healthcare contexts,\nincluding enhanced clinical decision support and more efficient information\nretrieval. Nevertheless, it acknowledges the impediments and complexities of\nthis integration, encompassing concerns regarding data privacy, transparency,\nresource-intensive requirements, and the necessity for model customization to\nalign with diverse healthcare domains.\n","authors":["Shyni Sharaf","V. S. Anoop"],"pdf_url":"https://arxiv.org/pdf/2310.07282v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.03347v4","updated":"2023-10-11T08:13:28Z","published":"2023-04-06T19:53:59Z","title":"Towards Interpretable Mental Health Analysis with Large Language Models","summary":" The latest large language models (LLMs) such as ChatGPT, exhibit strong\ncapabilities in automated mental health analysis. However, existing relevant\nstudies bear several limitations, including inadequate evaluations, lack of\nprompting strategies, and ignorance of exploring LLMs for explainability. To\nbridge these gaps, we comprehensively evaluate the mental health analysis and\nemotional reasoning ability of LLMs on 11 datasets across 5 tasks. We explore\nthe effects of different prompting strategies with unsupervised and distantly\nsupervised emotional information. Based on these prompts, we explore LLMs for\ninterpretable mental health analysis by instructing them to generate\nexplanations for each of their decisions. We convey strict human evaluations to\nassess the quality of the generated explanations, leading to a novel dataset\nwith 163 human-assessed explanations. We benchmark existing automatic\nevaluation metrics on this dataset to guide future related works. According to\nthe results, ChatGPT shows strong in-context learning ability but still has a\nsignificant gap with advanced task-specific methods. Careful prompt engineering\nwith emotional cues and expert-written few-shot examples can also effectively\nimprove performance on mental health analysis. In addition, ChatGPT generates\nexplanations that approach human performance, showing its great potential in\nexplainable mental health analysis.\n","authors":["Kailai Yang","Shaoxiong Ji","Tianlin Zhang","Qianqian Xie","Ziyan Kuang","Sophia Ananiadou"],"pdf_url":"https://arxiv.org/pdf/2304.03347v4.pdf","comment":"Accepted by EMNLP 2023 main conference as a long paper"},{"id":"http://arxiv.org/abs/2310.07279v1","updated":"2023-10-11T08:07:22Z","published":"2023-10-11T08:07:22Z","title":"Enhancing expressivity transfer in textless speech-to-speech translation","summary":" Textless speech-to-speech translation systems are rapidly advancing, thanks\nto the integration of self-supervised learning techniques. However, existing\nstate-of-the-art systems fall short when it comes to capturing and transferring\nexpressivity accurately across different languages. Expressivity plays a vital\nrole in conveying emotions, nuances, and cultural subtleties, thereby enhancing\ncommunication across diverse languages. To address this issue this study\npresents a novel method that operates at the discrete speech unit level and\nleverages multilingual emotion embeddings to capture language-agnostic\ninformation. Specifically, we demonstrate how these embeddings can be used to\neffectively predict the pitch and duration of speech units in the target\nlanguage. Through objective and subjective experiments conducted on a\nFrench-to-English translation task, our findings highlight the superior\nexpressivity transfer achieved by our approach compared to current\nstate-of-the-art systems.\n","authors":["Jarod Duret","Benjamin O'Brien","Yannick Estève","Titouan Parcollet"],"pdf_url":"https://arxiv.org/pdf/2310.07279v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07276v1","updated":"2023-10-11T07:57:08Z","published":"2023-10-11T07:57:08Z","title":"BioT5: Enriching Cross-modal Integration in Biology with Chemical\n Knowledge and Natural Language Associations","summary":" Recent advancements in biological research leverage the integration of\nmolecules, proteins, and natural language to enhance drug discovery. However,\ncurrent models exhibit several limitations, such as the generation of invalid\nmolecular SMILES, underutilization of contextual information, and equal\ntreatment of structured and unstructured knowledge. To address these issues, we\npropose $\\mathbf{BioT5}$, a comprehensive pre-training framework that enriches\ncross-modal integration in biology with chemical knowledge and natural language\nassociations. $\\mathbf{BioT5}$ utilizes SELFIES for $100%$ robust molecular\nrepresentations and extracts knowledge from the surrounding context of\nbio-entities in unstructured biological literature. Furthermore,\n$\\mathbf{BioT5}$ distinguishes between structured and unstructured knowledge,\nleading to more effective utilization of information. After fine-tuning, BioT5\nshows superior performance across a wide range of tasks, demonstrating its\nstrong capability of capturing underlying relations and properties of\nbio-entities. Our code is available at\n$\\href{https://github.com/QizhiPei/BioT5}{Github}$.\n","authors":["Qizhi Pei","Wei Zhang","Jinhua Zhu","Kehan Wu","Kaiyuan Gao","Lijun Wu","Yingce Xia","Rui Yan"],"pdf_url":"https://arxiv.org/pdf/2310.07276v1.pdf","comment":"Empirical Methods in Natural Language Processing (EMNLP 2023)"},{"id":"http://arxiv.org/abs/2310.07251v1","updated":"2023-10-11T07:27:34Z","published":"2023-10-11T07:27:34Z","title":"Ethical Reasoning over Moral Alignment: A Case and Framework for\n In-Context Ethical Policies in LLMs","summary":" In this position paper, we argue that instead of morally aligning LLMs to\nspecific set of ethical principles, we should infuse generic ethical reasoning\ncapabilities into them so that they can handle value pluralism at a global\nscale. When provided with an ethical policy, an LLM should be capable of making\ndecisions that are ethically consistent to the policy. We develop a framework\nthat integrates moral dilemmas with moral principles pertaining to different\nforamlisms of normative ethics, and at different levels of abstractions.\nInitial experiments with GPT-x models shows that while GPT-4 is a nearly\nperfect ethical reasoner, the models still have bias towards the moral values\nof Western and English speaking societies.\n","authors":["Abhinav Rao","Aditi Khandelwal","Kumar Tanmay","Utkarsh Agarwal","Monojit Choudhury"],"pdf_url":"https://arxiv.org/pdf/2310.07251v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.06450v2","updated":"2023-10-11T07:04:04Z","published":"2023-10-10T09:20:14Z","title":"Constructive Large Language Models Alignment with Diverse Feedback","summary":" In recent research on large language models (LLMs), there has been a growing\nemphasis on aligning these models with human values to reduce the impact of\nharmful content. However, current alignment methods often rely solely on\nsingular forms of human feedback, such as preferences, annotated labels, or\nnatural language critiques, overlooking the potential advantages of combining\nthese feedback types. This limitation leads to suboptimal performance, even\nwhen ample training data is available. In this paper, we introduce Constructive\nand Diverse Feedback (CDF) as a novel method to enhance LLM alignment, inspired\nby constructivist learning theory. Our approach involves collecting three\ndistinct types of feedback tailored to problems of varying difficulty levels\nwithin the training dataset. Specifically, we exploit critique feedback for\neasy problems, refinement feedback for medium problems, and preference feedback\nfor hard problems. By training our model with this diversified feedback, we\nachieve enhanced alignment performance while using less training data. To\nassess the effectiveness of CDF, we evaluate it against previous methods in\nthree downstream tasks: question answering, dialog generation, and text\nsummarization. Experimental results demonstrate that CDF achieves superior\nperformance even with a smaller training dataset.\n","authors":["Tianshu Yu","Ting-En Lin","Yuchuan Wu","Min Yang","Fei Huang","Yongbin Li"],"pdf_url":"https://arxiv.org/pdf/2310.06450v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07225v1","updated":"2023-10-11T06:26:19Z","published":"2023-10-11T06:26:19Z","title":"Exploring the Landscape of Large Language Models In Medical Question\n Answering: Observations and Open Questions","summary":" Large Language Models (LLMs) have shown promise in medical question answering\nby achieving passing scores in standardised exams and have been suggested as\ntools for supporting healthcare workers. Deploying LLMs into such a high-risk\ncontext requires a clear understanding of the limitations of these models. With\nthe rapid development and release of new LLMs, it is especially valuable to\nidentify patterns which exist across models and may, therefore, continue to\nappear in newer versions. In this paper, we evaluate a wide range of popular\nLLMs on their knowledge of medical questions in order to better understand\ntheir properties as a group. From this comparison, we provide preliminary\nobservations and raise open questions for further research.\n","authors":["Karolina Korgul","Andrew M. Bean","Felix Krones","Robert McCraith","Adam Mahdi"],"pdf_url":"https://arxiv.org/pdf/2310.07225v1.pdf","comment":"11 pages, 8 figures"},{"id":"http://arxiv.org/abs/2310.05028v3","updated":"2023-10-11T06:16:30Z","published":"2023-10-08T06:17:39Z","title":"Revisiting Large Language Models as Zero-shot Relation Extractors","summary":" Relation extraction (RE) consistently involves a certain degree of labeled or\nunlabeled data even if under zero-shot setting. Recent studies have shown that\nlarge language models (LLMs) transfer well to new tasks out-of-the-box simply\ngiven a natural language prompt, which provides the possibility of extracting\nrelations from text without any data and parameter tuning. This work focuses on\nthe study of exploring LLMs, such as ChatGPT, as zero-shot relation extractors.\nOn the one hand, we analyze the drawbacks of existing RE prompts and attempt to\nincorporate recent prompt techniques such as chain-of-thought (CoT) to improve\nzero-shot RE. We propose the summarize-and-ask (\\textsc{SumAsk}) prompting, a\nsimple prompt recursively using LLMs to transform RE inputs to the effective\nquestion answering (QA) format. On the other hand, we conduct comprehensive\nexperiments on various benchmarks and settings to investigate the capabilities\nof LLMs on zero-shot RE. Specifically, we have the following findings: (i)\n\\textsc{SumAsk} consistently and significantly improves LLMs performance on\ndifferent model sizes, benchmarks and settings; (ii) Zero-shot prompting with\nChatGPT achieves competitive or superior results compared with zero-shot and\nfully supervised methods; (iii) LLMs deliver promising performance in\nextracting overlapping relations; (iv) The performance varies greatly regarding\ndifferent relations. Different from small language models, LLMs are effective\nin handling challenge none-of-the-above (NoTA) relation.\n","authors":["Guozheng Li","Peng Wang","Wenjun Ke"],"pdf_url":"https://arxiv.org/pdf/2310.05028v3.pdf","comment":"Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2303.08518v3","updated":"2023-10-11T05:40:41Z","published":"2023-03-15T10:53:49Z","title":"UPRISE: Universal Prompt Retrieval for Improving Zero-Shot Evaluation","summary":" Large Language Models (LLMs) are popular for their impressive abilities, but\nthe need for model-specific fine-tuning or task-specific prompt engineering can\nhinder their generalization. We propose UPRISE (Universal Prompt Retrieval for\nImproving zero-Shot Evaluation), which tunes a lightweight and versatile\nretriever that automatically retrieves prompts for a given zero-shot task\ninput. Specifically, we demonstrate universality in a cross-task and\ncross-model scenario: the retriever is tuned on a diverse set of tasks, but\ntested on unseen task types; we use a small frozen LLM, GPT-Neo-2.7B, for\ntuning the retriever, but test the retriever on different LLMs of much larger\nscales, such as BLOOM-7.1B, OPT-66B and GPT3-175B. Additionally, we show that\nUPRISE mitigates the hallucination problem in our experiments with ChatGPT,\nsuggesting its potential to improve even the strongest LLMs. Our model and code\nare available at https://github.com/microsoft/LMOps.\n","authors":["Daixuan Cheng","Shaohan Huang","Junyu Bi","Yuefeng Zhan","Jianfeng Liu","Yujing Wang","Hao Sun","Furu Wei","Denvy Deng","Qi Zhang"],"pdf_url":"https://arxiv.org/pdf/2303.08518v3.pdf","comment":"EMNLP 2023 Main Conference"},{"id":"http://arxiv.org/abs/2305.16340v2","updated":"2023-10-11T05:32:13Z","published":"2023-05-24T03:47:22Z","title":"Segmented Recurrent Transformer: An Efficient Sequence-to-Sequence Model","summary":" Transformers have shown dominant performance across a range of domains\nincluding language and vision. However, their computational cost grows\nquadratically with the sequence length, making their usage prohibitive for\nresource-constrained applications. To counter this, our approach is to divide\nthe whole sequence into segments and use local attention mechanism on the\nindividual segments. We propose a segmented recurrent transformer (SRformer)\nthat combines segmented (local) attention with recurrent attention. The loss\ncaused by reducing the attention window length is compensated by aggregating\ninformation across segments with recurrent attention. SRformer leverages\nRecurrent Accumulate-and-Fire (RAF) neurons' inherent memory to update the\ncumulative product of keys and values. The segmented attention and lightweight\nRAF neurons ensure the efficiency of the proposed transformer. Such an approach\nleads to models with sequential processing capability at a lower\ncomputation/memory cost. We apply the proposed method to T5 and BART\ntransformers. The modified models are tested on summarization datasets\nincluding CNN-dailymail, XSUM, ArXiv, and MediaSUM. Notably, using segmented\ninputs of varied sizes, the proposed model achieves $6-22\\%$ higher ROUGE1\nscores than a segmented transformer and outperforms other recurrent transformer\napproaches. Furthermore, compared to full attention, the proposed model reduces\nthe computational complexity of cross attention by around $40\\%$.\n","authors":["Yinghan Long","Sayeed Shafayet Chowdhury","Kaushik Roy"],"pdf_url":"https://arxiv.org/pdf/2305.16340v2.pdf","comment":"Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2305.14251v2","updated":"2023-10-11T05:27:50Z","published":"2023-05-23T17:06:00Z","title":"FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long\n Form Text Generation","summary":" Evaluating the factuality of long-form text generated by large language\nmodels (LMs) is non-trivial because (1) generations often contain a mixture of\nsupported and unsupported pieces of information, making binary judgments of\nquality inadequate, and (2) human evaluation is time-consuming and costly. In\nthis paper, we introduce FACTSCORE, a new evaluation that breaks a generation\ninto a series of atomic facts and computes the percentage of atomic facts\nsupported by a reliable knowledge source. We conduct an extensive human\nevaluation to obtain FACTSCOREs of people biographies generated by several\nstate-of-the-art commercial LMs -- InstructGPT, ChatGPT, and the\nretrieval-augmented PerplexityAI -- and report new analysis demonstrating the\nneed for such a fine-grained score (e.g., ChatGPT only achieves 58%). Since\nhuman evaluation is costly, we also introduce an automated model that estimates\nFACTSCORE using retrieval and a strong language model, with less than a 2%\nerror rate. Finally, we use this automated metric to evaluate 6,500 generations\nfrom a new set of 13 recent LMs that would have cost $26K if evaluated by\nhumans, with various findings: GPT-4 and ChatGPT are more factual than public\nmodels, and Vicuna and Alpaca are some of the best public models. FACTSCORE is\navailable for public use via `pip install factscore`.\n","authors":["Sewon Min","Kalpesh Krishna","Xinxi Lyu","Mike Lewis","Wen-tau Yih","Pang Wei Koh","Mohit Iyyer","Luke Zettlemoyer","Hannaneh Hajishirzi"],"pdf_url":"https://arxiv.org/pdf/2305.14251v2.pdf","comment":"25 pages; 7 figures. Published as a main conference paper at EMNLP\n 2023. Code available at https://github.com/shmsw25/FActScore"},{"id":"http://arxiv.org/abs/2204.07994v2","updated":"2023-10-11T05:12:10Z","published":"2022-04-17T12:33:34Z","title":"Knowledgeable Salient Span Mask for Enhancing Language Models as\n Knowledge Base","summary":" Pre-trained language models (PLMs) like BERT have made significant progress\nin various downstream NLP tasks. However, by asking models to do cloze-style\ntests, recent work finds that PLMs are short in acquiring knowledge from\nunstructured text. To understand the internal behaviour of PLMs in retrieving\nknowledge, we first define knowledge-baring (K-B) tokens and knowledge-free\n(K-F) tokens for unstructured text and ask professional annotators to label\nsome samples manually. Then, we find that PLMs are more likely to give wrong\npredictions on K-B tokens and attend less attention to those tokens inside the\nself-attention module. Based on these observations, we develop two solutions to\nhelp the model learn more knowledge from unstructured text in a fully\nself-supervised manner. Experiments on knowledge-intensive tasks show the\neffectiveness of the proposed methods. To our best knowledge, we are the first\nto explore fully self-supervised learning of knowledge in continual\npre-training.\n","authors":["Cunxiang Wang","Fuli Luo","Yanyang Li","Runxin Xu","Fei Huang","Yue Zhang"],"pdf_url":"https://arxiv.org/pdf/2204.07994v2.pdf","comment":"NLPCC-2023"},{"id":"http://arxiv.org/abs/2309.17421v2","updated":"2023-10-11T05:07:37Z","published":"2023-09-29T17:34:51Z","title":"The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)","summary":" Large multimodal models (LMMs) extend large language models (LLMs) with\nmulti-sensory skills, such as visual understanding, to achieve stronger generic\nintelligence. In this paper, we analyze the latest model, GPT-4V(ision), to\ndeepen the understanding of LMMs. The analysis focuses on the intriguing tasks\nthat GPT-4V can perform, containing test samples to probe the quality and\ngenericity of GPT-4V's capabilities, its supported inputs and working modes,\nand the effective ways to prompt the model. In our approach to exploring\nGPT-4V, we curate and organize a collection of carefully designed qualitative\nsamples spanning a variety of domains and tasks. Observations from these\nsamples demonstrate that GPT-4V's unprecedented ability in processing\narbitrarily interleaved multimodal inputs and the genericity of its\ncapabilities together make GPT-4V a powerful multimodal generalist system.\nFurthermore, GPT-4V's unique capability of understanding visual markers drawn\non input images can give rise to new human-computer interaction methods such as\nvisual referring prompting. We conclude the report with in-depth discussions on\nthe emerging application scenarios and the future research directions for\nGPT-4V-based systems. We hope that this preliminary exploration will inspire\nfuture research on the next-generation multimodal task formulation, new ways to\nexploit and enhance LMMs to solve real-world problems, and gaining better\nunderstanding of multimodal foundation models. Finally, we acknowledge that the\nmodel under our study is solely the product of OpenAI's innovative work, and\nthey should be fully credited for its development. Please see the GPT-4V\ncontributions paper for the authorship and credit attribution:\nhttps://cdn.openai.com/contributions/gpt-4v.pdf\n","authors":["Zhengyuan Yang","Linjie Li","Kevin Lin","Jianfeng Wang","Chung-Ching Lin","Zicheng Liu","Lijuan Wang"],"pdf_url":"https://arxiv.org/pdf/2309.17421v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.06200v2","updated":"2023-10-11T05:00:10Z","published":"2023-10-09T23:02:07Z","title":"The Importance of Prompt Tuning for Automated Neuron Explanations","summary":" Recent advances have greatly increased the capabilities of large language\nmodels (LLMs), but our understanding of the models and their safety has not\nprogressed as fast. In this paper we aim to understand LLMs deeper by studying\ntheir individual neurons. We build upon previous work showing large language\nmodels such as GPT-4 can be useful in explaining what each neuron in a language\nmodel does. Specifically, we analyze the effect of the prompt used to generate\nexplanations and show that reformatting the explanation prompt in a more\nnatural way can significantly improve neuron explanation quality and greatly\nreduce computational cost. We demonstrate the effects of our new prompts in\nthree different ways, incorporating both automated and human evaluations.\n","authors":["Justin Lee","Tuomas Oikarinen","Arjun Chatha","Keng-Chi Chang","Yilan Chen","Tsui-Wei Weng"],"pdf_url":"https://arxiv.org/pdf/2310.06200v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07188v1","updated":"2023-10-11T04:30:18Z","published":"2023-10-11T04:30:18Z","title":"Adaptive Gating in Mixture-of-Experts based Language Models","summary":" Large language models, such as OpenAI's ChatGPT, have demonstrated\nexceptional language understanding capabilities in various NLP tasks. Sparsely\nactivated mixture-of-experts (MoE) has emerged as a promising solution for\nscaling models while maintaining a constant number of computational operations.\nExisting MoE model adopts a fixed gating network where each token is computed\nby the same number of experts. However, this approach contradicts our intuition\nthat the tokens in each sequence vary in terms of their linguistic complexity\nand, consequently, require different computational costs. Little is discussed\nin prior research on the trade-off between computation per token and model\nperformance. This paper introduces adaptive gating in MoE, a flexible training\nstrategy that allows tokens to be processed by a variable number of experts\nbased on expert probability distribution. The proposed framework preserves\nsparsity while improving training efficiency. Additionally, curriculum learning\nis leveraged to further reduce training time. Extensive experiments on diverse\nNLP tasks show that adaptive gating reduces at most 22.5% training time while\nmaintaining inference quality. Moreover, we conduct a comprehensive analysis of\nthe routing decisions and present our insights when adaptive gating is used.\n","authors":["Jiamin Li","Qiang Su","Yitao Yang","Yimin Jiang","Cong Wang","Hong Xu"],"pdf_url":"https://arxiv.org/pdf/2310.07188v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2301.12920v4","updated":"2023-10-11T04:08:26Z","published":"2023-01-30T14:19:29Z","title":"Active Learning for Multilingual Semantic Parser","summary":" Current multilingual semantic parsing (MSP) datasets are almost all collected\nby translating the utterances in the existing datasets from the resource-rich\nlanguage to the target language. However, manual translation is costly. To\nreduce the translation effort, this paper proposes the first active learning\nprocedure for MSP (AL-MSP). AL-MSP selects only a subset from the existing\ndatasets to be translated. We also propose a novel selection method that\nprioritizes the examples diversifying the logical form structures with more\nlexical choices, and a novel hyperparameter tuning method that needs no extra\nannotation cost. Our experiments show that AL-MSP significantly reduces\ntranslation costs with ideal selection methods. Our selection method with\nproper hyperparameters yields better parsing performance than the other\nbaselines on two multilingual datasets.\n","authors":["Zhuang Li","Gholamreza Haffari"],"pdf_url":"https://arxiv.org/pdf/2301.12920v4.pdf","comment":"EACL 2023 (findings)"},{"id":"http://arxiv.org/abs/2310.07177v1","updated":"2023-10-11T04:03:42Z","published":"2023-10-11T04:03:42Z","title":"Online Speculative Decoding","summary":" Speculative decoding is a pivotal technique to accelerate the inference of\nlarge language models (LLMs) by employing a smaller draft model to predict the\ntarget model's outputs. However, its efficacy can be limited due to the low\npredictive accuracy of the draft model, particularly when faced with diverse\ntext inputs and a significant capability gap between the draft and target\nmodels. We introduce online speculative decoding (OSD) to address this\nchallenge. The main idea is to continually update (multiple) draft model(s) on\nobserved user query data using the abundant excess computational power in an\nLLM serving cluster. Given that LLM inference is memory-bounded, the surplus\ncomputational power in a typical LLM serving cluster can be repurposed for\nonline retraining of draft models, thereby making the training cost-neutral.\nSince the query distribution of an LLM service is relatively simple, retraining\non query distribution enables the draft model to more accurately predict the\ntarget model's outputs, particularly on data originating from query\ndistributions. As the draft model evolves online, it aligns with the query\ndistribution in real time, mitigating distribution shifts. We develop a\nprototype of online speculative decoding based on online knowledge distillation\nand evaluate it using both synthetic and real query data on several popular\nLLMs. The results show a substantial increase in the token acceptance rate by\n0.1 to 0.65, which translates into 1.22x to 3.06x latency reduction.\n","authors":["Xiaoxuan Liu","Lanxiang Hu","Peter Bailis","Ion Stoica","Zhijie Deng","Alvin Cheung","Hao Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.07177v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07170v1","updated":"2023-10-11T03:39:46Z","published":"2023-10-11T03:39:46Z","title":"PHALM: Building a Knowledge Graph from Scratch by Prompting Humans and a\n Language Model","summary":" Despite the remarkable progress in natural language understanding with\npretrained Transformers, neural language models often do not handle commonsense\nknowledge well. Toward commonsense-aware models, there have been attempts to\nobtain knowledge, ranging from automatic acquisition to crowdsourcing. However,\nit is difficult to obtain a high-quality knowledge base at a low cost,\nespecially from scratch. In this paper, we propose PHALM, a method of building\na knowledge graph from scratch, by prompting both crowdworkers and a large\nlanguage model (LLM). We used this method to build a Japanese event knowledge\ngraph and trained Japanese commonsense generation models. Experimental results\nrevealed the acceptability of the built graph and inferences generated by the\ntrained models. We also report the difference in prompting humans and an LLM.\nOur code, data, and models are available at\ngithub.com/nlp-waseda/comet-atomic-ja.\n","authors":["Tatsuya Ide","Eiki Murata","Daisuke Kawahara","Takato Yamazaki","Shengzhe Li","Kenta Shinzato","Toshinori Sato"],"pdf_url":"https://arxiv.org/pdf/2310.07170v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.15074v2","updated":"2023-10-11T03:38:56Z","published":"2023-05-24T11:55:59Z","title":"Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For\n Large Language Models","summary":" The performance of large language models (LLMs) on existing reasoning\nbenchmarks has significantly improved over the past years. In response, we\npresent JEEBench, a considerably more challenging benchmark dataset for\nevaluating the problem solving abilities of LLMs. We curate 515 challenging\npre-engineering mathematics, physics and chemistry problems from the highly\ncompetitive IIT JEE-Advanced exam. Long-horizon reasoning on top of deep\nin-domain knowledge is essential for solving problems in this benchmark. Our\nevaluation on various open-source and proprietary models reveals that the\nhighest performance, even after using techniques like self-consistency,\nself-refinement and chain-of-thought prompting, is less than 40\\%. The typical\nfailure modes of GPT-4, the best model, are errors in algebraic manipulation,\ndifficulty in grounding abstract concepts into mathematical equations\naccurately and failure in retrieving relevant domain-specific concepts. We also\nobserve that by mere prompting, GPT-4 is unable to assess risk introduced by\nnegative marking for incorrect answers. For this, we develop a post-hoc\nconfidence-thresholding method over self-consistency, which enables effective\nresponse selection. We hope that our challenging benchmark will guide future\nre-search in problem-solving using LLMs.\n","authors":["Daman Arora","Himanshu Gaurav Singh"," Mausam"],"pdf_url":"https://arxiv.org/pdf/2305.15074v2.pdf","comment":"v2"},{"id":"http://arxiv.org/abs/2310.07161v1","updated":"2023-10-11T03:19:22Z","published":"2023-10-11T03:19:22Z","title":"Psychoacoustic Challenges Of Speech Enhancement On VoIP Platforms","summary":" Within the ambit of VoIP (Voice over Internet Protocol) telecommunications,\nthe complexities introduced by acoustic transformations merit rigorous\nanalysis. This research, rooted in the exploration of proprietary sender-side\ndenoising effects, meticulously evaluates platforms such as Google Meets and\nZoom. The study draws upon the Deep Noise Suppression (DNS) 2020 dataset,\nensuring a structured examination tailored to various denoising settings and\nreceiver interfaces. A methodological novelty is introduced via the Oaxaca\ndecomposition, traditionally an econometric tool, repurposed herein to analyze\nacoustic-phonetic perturbations within VoIP systems. To further ground the\nimplications of these transformations, psychoacoustic metrics, specifically\nPESQ and STOI, were harnessed to furnish a comprehensive understanding of\nspeech alterations. Cumulatively, the insights garnered underscore the\nintricate landscape of VoIP-influenced acoustic dynamics. In addition to the\nprimary findings, a multitude of metrics are reported, extending the research\npurview. Moreover, out-of-domain benchmarking for both time and time-frequency\ndomain speech enhancement models is included, thereby enhancing the depth and\napplicability of this inquiry.\n","authors":["Joseph Konan","Ojas Bhargave","Shikhar Agnihotri","Shuo Han","Yunyang Zeng","Ankit Shah","Bhiksha Raj"],"pdf_url":"https://arxiv.org/pdf/2310.07161v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.02796v2","updated":"2023-10-11T03:16:23Z","published":"2023-07-06T06:11:51Z","title":"VerifAI: Verified Generative AI","summary":" Generative AI has made significant strides, yet concerns about the accuracy\nand reliability of its outputs continue to grow. Such inaccuracies can have\nserious consequences such as inaccurate decision-making, the spread of false\ninformation, privacy violations, legal liabilities, and more. Although efforts\nto address these risks are underway, including explainable AI and responsible\nAI practices such as transparency, privacy protection, bias mitigation, and\nsocial and environmental responsibility, misinformation caused by generative AI\nwill remain a significant challenge. We propose that verifying the outputs of\ngenerative AI from a data management perspective is an emerging issue for\ngenerative AI. This involves analyzing the underlying data from multi-modal\ndata lakes, including text files, tables, and knowledge graphs, and assessing\nits quality and consistency. By doing so, we can establish a stronger\nfoundation for evaluating the outputs of generative AI models. Such an approach\ncan ensure the correctness of generative AI, promote transparency, and enable\ndecision-making with greater confidence. Our vision is to promote the\ndevelopment of verifiable generative AI and contribute to a more trustworthy\nand responsible use of AI.\n","authors":["Nan Tang","Chenyu Yang","Ju Fan","Lei Cao","Yuyu Luo","Alon Halevy"],"pdf_url":"https://arxiv.org/pdf/2307.02796v2.pdf","comment":"8 pages, 4 figures"},{"id":"http://arxiv.org/abs/2310.07155v1","updated":"2023-10-11T03:01:42Z","published":"2023-10-11T03:01:42Z","title":"\"A Tale of Two Movements\": Identifying and Comparing Perspectives in\n #BlackLivesMatter and #BlueLivesMatter Movements-related Tweets using Weakly\n Supervised Graph-based Structured Prediction","summary":" Social media has become a major driver of social change, by facilitating the\nformation of online social movements. Automatically understanding the\nperspectives driving the movement and the voices opposing it, is a challenging\ntask as annotated data is difficult to obtain. We propose a weakly supervised\ngraph-based approach that explicitly models perspectives in\n#BackLivesMatter-related tweets. Our proposed approach utilizes a\nsocial-linguistic representation of the data. We convert the text to a graph by\nbreaking it into structured elements and connect it with the social network of\nauthors, then structured prediction is done over the elements for identifying\nperspectives. Our approach uses a small seed set of labeled examples. We\nexperiment with large language models for generating artificial training\nexamples, compare them to manual annotation, and find that it achieves\ncomparable performance. We perform quantitative and qualitative analyses using\na human-annotated test set. Our model outperforms multitask baselines by a\nlarge margin, successfully characterizing the perspectives supporting and\nopposing #BLM.\n","authors":["Shamik Roy","Dan Goldwasser"],"pdf_url":"https://arxiv.org/pdf/2310.07155v1.pdf","comment":"Accepted version to Findings of EMNLP 2023 (camera ready coming soon)"},{"id":"http://arxiv.org/abs/2310.07147v1","updated":"2023-10-11T02:47:40Z","published":"2023-10-11T02:47:40Z","title":"QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources","summary":" Large Language Models (LLMs) have showcased remarkable impacts across a wide\nspectrum of natural language processing tasks. Fine-tuning these pre-trained\nmodels on downstream datasets provides further significant performance gains,\nbut this process has been challenging due to its extraordinary resource\nrequirements. To this end, existing efforts focus on parameter-efficient\nfine-tuning, which, unfortunately, fail to capitalize on the powerful potential\nof full-parameter fine-tuning. In this work, we propose QFT, a novel Quantized\nFull-parameter Tuning framework for LLMs that enables memory-efficient\nfine-tuning without harming performance. Our framework incorporates two novel\nideas: (i) we adopt the efficient Lion optimizer, which only keeps track of the\nmomentum and has consistent update magnitudes for each parameter, an inherent\nadvantage for robust quantization; and (ii) we quantize all model states and\nstore them as integer values, and present a gradient flow and parameter update\nscheme for the quantized weights. As a result, QFT reduces the model state\nmemory to 21% of the standard solution while achieving comparable performance,\ne.g., tuning a LLaMA-7B model requires only <30GB of memory, satisfied by a\nsingle A6000 GPU.\n","authors":["Zhikai Li","Xiaoxuan Liu","Banghua Zhu","Zhen Dong","Qingyi Gu","Kurt Keutzer"],"pdf_url":"https://arxiv.org/pdf/2310.07147v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07146v1","updated":"2023-10-11T02:47:21Z","published":"2023-10-11T02:47:21Z","title":"Empowering Psychotherapy with Large Language Models: Cognitive\n Distortion Detection through Diagnosis of Thought Prompting","summary":" Mental illness remains one of the most critical public health issues of our\ntime, due to the severe scarcity and accessibility limit of professionals.\nPsychotherapy requires high-level expertise to conduct deep, complex reasoning\nand analysis on the cognition modeling of the patients. In the era of Large\nLanguage Models, we believe it is the right time to develop AI assistance for\ncomputational psychotherapy. We study the task of cognitive distortion\ndetection and propose the Diagnosis of Thought (DoT) prompting. DoT performs\ndiagnosis on the patient's speech via three stages: subjectivity assessment to\nseparate the facts and the thoughts; contrastive reasoning to elicit the\nreasoning processes supporting and contradicting the thoughts; and schema\nanalysis to summarize the cognition schemas. The generated diagnosis rationales\nthrough the three stages are essential for assisting the professionals.\nExperiments demonstrate that DoT obtains significant improvements over ChatGPT\nfor cognitive distortion detection, while generating high-quality rationales\napproved by human experts.\n","authors":["Zhiyu Chen","Yujie Lu","William Yang Wang"],"pdf_url":"https://arxiv.org/pdf/2310.07146v1.pdf","comment":"EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2310.05280v2","updated":"2023-10-11T02:32:10Z","published":"2023-10-08T21:03:18Z","title":"Are Personalized Stochastic Parrots More Dangerous? Evaluating Persona\n Biases in Dialogue Systems","summary":" Recent advancements in Large Language Models empower them to follow freeform\ninstructions, including imitating generic or specific demographic personas in\nconversations. Generic personas refer to an individual from a demographic group\n(e.g. an Asian person), whereas specific personas can be actual names of\nhistorical figures. While the adoption of personas allows dialogue systems to\nbe more engaging and approachable to users, it also carries the potential risk\nof exacerbating social biases in model responses, further causing societal\nharms through interactions with users. In this paper, we systematically study\n\"persona biases\", which we define to be the sensitivity of harmful dialogue\nmodel behaviors to different persona adoptions. We categorize persona biases\ninto biases in harmful expression and harmful agreement, as well as establish a\ncomprehensive evaluation framework to measure persona biases in five aspects:\nOffensiveness, Toxic Continuation, Regard, Stereotype Agreement, and Toxic\nAgreement. Additionally, we propose to comprehensively investigate persona\nbiases through experimenting with UniversalPersona, a systematized persona\ndataset with a comprehensive list of both generic and specific model personas.\nThrough benchmarking on four different models, including Blender, ChatGPT,\nAlpaca, and Vicuna, our study uncovers significant persona biases in these\ndialogue systems.Findings of our study underscores the immediate need to\nrevisit the use of persona traits in dialogue agents, to ensure their safe\napplication.\n","authors":["Yixin Wan","Jieyu Zhao","Aman Chadha","Nanyun Peng","Kai-Wei Chang"],"pdf_url":"https://arxiv.org/pdf/2310.05280v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07137v1","updated":"2023-10-11T02:22:28Z","published":"2023-10-11T02:22:28Z","title":"AE-smnsMLC: Multi-Label Classification with Semantic Matching and\n Negative Label Sampling for Product Attribute Value Extraction","summary":" Product attribute value extraction plays an important role for many\nreal-world applications in e-Commerce such as product search and\nrecommendation. Previous methods treat it as a sequence labeling task that\nneeds more annotation for position of values in the product text. This limits\ntheir application to real-world scenario in which only attribute values are\nweakly-annotated for each product without their position. Moreover, these\nmethods only use product text (i.e., product title and description) and do not\nconsider the semantic connection between the multiple attribute values of a\ngiven product and its text, which can help attribute value extraction. In this\npaper, we reformulate this task as a multi-label classification task that can\nbe applied for real-world scenario in which only annotation of attribute values\nis available to train models (i.e., annotation of positional information of\nattribute values is not available). We propose a classification model with\nsemantic matching and negative label sampling for attribute value extraction.\nSemantic matching aims to capture semantic interactions between attribute\nvalues of a given product and its text. Negative label sampling aims to enhance\nthe model's ability of distinguishing similar values belonging to the same\nattribute. Experimental results on three subsets of a large real-world\ne-Commerce dataset demonstrate the effectiveness and superiority of our\nproposed model.\n","authors":["Zhongfen Deng","Wei-Te Chen","Lei Chen","Philip S. Yu"],"pdf_url":"https://arxiv.org/pdf/2310.07137v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07135v1","updated":"2023-10-11T02:16:12Z","published":"2023-10-11T02:16:12Z","title":"Comparing Styles across Languages","summary":" Understanding how styles differ across languages is advantageous for training\nboth humans and computers to generate culturally appropriate text. We introduce\nan explanation framework to extract stylistic differences from multilingual LMs\nand compare styles across languages. Our framework (1) generates comprehensive\nstyle lexica in any language and (2) consolidates feature importances from LMs\ninto comparable lexical categories. We apply this framework to compare\npoliteness, creating the first holistic multilingual politeness dataset and\nexploring how politeness varies across four languages. Our approach enables an\neffective evaluation of how distinct linguistic categories contribute to\nstylistic variations and provides interpretable insights into how people\ncommunicate differently around the world.\n","authors":["Shreya Havaldar","Matthew Pressimone","Eric Wong","Lyle Ungar"],"pdf_url":"https://arxiv.org/pdf/2310.07135v1.pdf","comment":"To appear in EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.07106v1","updated":"2023-10-11T01:03:42Z","published":"2023-10-11T01:03:42Z","title":"The Temporal Structure of Language Processing in the Human Brain\n Corresponds to The Layered Hierarchy of Deep Language Models","summary":" Deep Language Models (DLMs) provide a novel computational paradigm for\nunderstanding the mechanisms of natural language processing in the human brain.\nUnlike traditional psycholinguistic models, DLMs use layered sequences of\ncontinuous numerical vectors to represent words and context, allowing a\nplethora of emerging applications such as human-like text generation. In this\npaper we show evidence that the layered hierarchy of DLMs may be used to model\nthe temporal dynamics of language comprehension in the brain by demonstrating a\nstrong correlation between DLM layer depth and the time at which layers are\nmost predictive of the human brain. Our ability to temporally resolve\nindividual layers benefits from our use of electrocorticography (ECoG) data,\nwhich has a much higher temporal resolution than noninvasive methods like fMRI.\nUsing ECoG, we record neural activity from participants listening to a\n30-minute narrative while also feeding the same narrative to a high-performing\nDLM (GPT2-XL). We then extract contextual embeddings from the different layers\nof the DLM and use linear encoding models to predict neural activity. We first\nfocus on the Inferior Frontal Gyrus (IFG, or Broca's area) and then extend our\nmodel to track the increasing temporal receptive window along the linguistic\nprocessing hierarchy from auditory to syntactic and semantic areas. Our results\nreveal a connection between human language processing and DLMs, with the DLM's\nlayer-by-layer accumulation of contextual information mirroring the timing of\nneural activity in high-order language areas.\n","authors":["Ariel Goldstein","Eric Ham","Mariano Schain","Samuel Nastase","Zaid Zada","Avigail Dabush","Bobbi Aubrey","Harshvardhan Gazula","Amir Feder","Werner K Doyle","Sasha Devore","Patricia Dugan","Daniel Friedman","Roi Reichart","Michael Brenner","Avinatan Hassidim","Orrin Devinsky","Adeen Flinker","Omer Levy","Uri Hasson"],"pdf_url":"https://arxiv.org/pdf/2310.07106v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07096v1","updated":"2023-10-11T00:38:57Z","published":"2023-10-11T00:38:57Z","title":"Sparse Universal Transformer","summary":" The Universal Transformer (UT) is a variant of the Transformer that shares\nparameters across its layers. Empirical evidence shows that UTs have better\ncompositional generalization than Vanilla Transformers (VTs) in formal language\ntasks. The parameter-sharing also affords it better parameter efficiency than\nVTs. Despite its many advantages, scaling UT parameters is much more compute\nand memory intensive than scaling up a VT. This paper proposes the Sparse\nUniversal Transformer (SUT), which leverages Sparse Mixture of Experts (SMoE)\nand a new stick-breaking-based dynamic halting mechanism to reduce UT's\ncomputation complexity while retaining its parameter efficiency and\ngeneralization ability. Experiments show that SUT achieves the same performance\nas strong baseline models while only using half computation and parameters on\nWMT'14 and strong generalization results on formal language tasks (Logical\ninference and CFQ). The new halting mechanism also enables around 50\\%\nreduction in computation during inference with very little performance decrease\non formal language tasks.\n","authors":["Shawn Tan","Yikang Shen","Zhenfang Chen","Aaron Courville","Chuang Gan"],"pdf_url":"https://arxiv.org/pdf/2310.07096v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.09442v3","updated":"2023-10-11T00:37:33Z","published":"2023-06-15T18:49:50Z","title":"Explore, Establish, Exploit: Red Teaming Language Models from Scratch","summary":" Deploying large language models (LMs) can pose hazards from harmful outputs\nsuch as toxic or false text. Prior work has introduced automated tools that\nelicit harmful outputs to identify these risks. While this is a valuable step\ntoward securing models, these approaches rely on a pre-existing way to\nefficiently classify undesirable outputs. Using a pre-existing classifier does\nnot allow for red-teaming to be tailored to the target model. Furthermore, when\nfailures can be easily classified in advance, red-teaming has limited marginal\nvalue because problems can be avoided by simply filtering training data and/or\nmodel outputs. Here, we consider red-teaming \"from scratch,\" in which the\nadversary does not begin with a way to classify failures. Our framework\nconsists of three steps: 1) Exploring the model's range of behaviors in the\ndesired context; 2) Establishing a definition and measurement for undesired\nbehavior (e.g., a classifier trained to reflect human evaluations); and 3)\nExploiting the model's flaws using this measure to develop diverse adversarial\nprompts. We use this approach to red-team GPT-3 to discover classes of inputs\nthat elicit false statements. In doing so, we construct the CommonClaim dataset\nof 20,000 statements labeled by humans as common-knowledge-true, common\nknowledge-false, or neither. We are making code and data available.\n","authors":["Stephen Casper","Jason Lin","Joe Kwon","Gatlen Culp","Dylan Hadfield-Menell"],"pdf_url":"https://arxiv.org/pdf/2306.09442v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07093v1","updated":"2023-10-11T00:18:29Z","published":"2023-10-11T00:18:29Z","title":"Argumentative Stance Prediction: An Exploratory Study on Multimodality\n and Few-Shot Learning","summary":" To advance argumentative stance prediction as a multimodal problem, the First\nShared Task in Multimodal Argument Mining hosted stance prediction in crucial\nsocial topics of gun control and abortion. Our exploratory study attempts to\nevaluate the necessity of images for stance prediction in tweets and compare\nout-of-the-box text-based large-language models (LLM) in few-shot settings\nagainst fine-tuned unimodal and multimodal models. Our work suggests an\nensemble of fine-tuned text-based language models (0.817 F1-score) outperforms\nboth the multimodal (0.677 F1-score) and text-based few-shot prediction using a\nrecent state-of-the-art LLM (0.550 F1-score). In addition to the differences in\nperformance, our findings suggest that the multimodal models tend to perform\nbetter when image content is summarized as natural language over their native\npixel structure and, using in-context examples improves few-shot performance of\nLLMs.\n","authors":["Arushi Sharma","Abhibha Gupta","Maneesh Bilalpur"],"pdf_url":"https://arxiv.org/pdf/2310.07093v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07091v1","updated":"2023-10-11T00:14:40Z","published":"2023-10-11T00:14:40Z","title":"Jaeger: A Concatenation-Based Multi-Transformer VQA Model","summary":" Document-based Visual Question Answering poses a challenging task between\nlinguistic sense disambiguation and fine-grained multimodal retrieval. Although\nthere has been encouraging progress in document-based question answering due to\nthe utilization of large language and open-world prior models\\cite{1}, several\nchallenges persist, including prolonged response times, extended inference\ndurations, and imprecision in matching. In order to overcome these challenges,\nwe propose Jaegar, a concatenation-based multi-transformer VQA model. To derive\nquestion features, we leverage the exceptional capabilities of RoBERTa\nlarge\\cite{2} and GPT2-xl\\cite{3} as feature extractors. Subsequently, we\nsubject the outputs from both models to a concatenation process. This operation\nallows the model to consider information from diverse sources concurrently,\nstrengthening its representational capability. By leveraging pre-trained models\nfor feature extraction, our approach has the potential to amplify the\nperformance of these models through concatenation. After concatenation, we\napply dimensionality reduction to the output features, reducing the model's\ncomputational effectiveness and inference time. Empirical results demonstrate\nthat our proposed model achieves competitive performance on Task C of the\nPDF-VQA Dataset. If the user adds any new data, they should make sure to style\nit as per the instructions provided in previous sections.\n","authors":["Jieting Long","Zewei Shi","Penghao Jiang","Yidong Gan"],"pdf_url":"https://arxiv.org/pdf/2310.07091v1.pdf","comment":"This paper is the technical research paper of CIKM 2023 DocIU\n challenges. The authors received the CIKM 2023 DocIU Winner Award, sponsored\n by Google, Microsoft, and the Centre for data-driven geoscience"},{"id":"http://arxiv.org/abs/2310.07088v1","updated":"2023-10-11T00:01:41Z","published":"2023-10-11T00:01:41Z","title":"Diversity of Thought Improves Reasoning Abilities of Large Language\n Models","summary":" Large language models (LLMs) are documented to struggle in settings that\nrequire complex reasoning. Nevertheless, instructing the model to break down\nthe problem into smaller reasoning steps (Wei et al., 2022), or ensembling\nvarious generations through modifying decoding steps (Wang et al., 2023) boosts\nperformance. Current methods assume that the input prompt is fixed and expect\nthe decoding strategies to introduce the diversity needed for ensembling. In\nthis work, we relax this assumption and discuss how one can create and leverage\nvariations of the input prompt as a means to diversity of thought to improve\nmodel performance. We propose a method that automatically improves prompt\ndiversity by soliciting feedback from the LLM to ideate approaches that fit for\nthe problem. We then ensemble the diverse prompts in our method DIV-SE (DIVerse\nreasoning path Self-Ensemble) across multiple inference calls. We also propose\na cost-effective alternative where diverse prompts are used within a single\ninference call; we call this IDIV-SE (In-call DIVerse reasoning path\nSelf-Ensemble). Under a fixed generation budget, DIV-SE and IDIV-SE outperform\nthe previously discussed baselines using both GPT-3.5 and GPT-4 on several\nreasoning benchmarks, without modifying the decoding process. Additionally,\nDIV-SE advances state-of-the-art performance on recent planning benchmarks\n(Valmeekam et al., 2023), exceeding the highest previously reported accuracy by\nat least 29.6 percentage points on the most challenging 4/5 Blocksworld task.\nOur results shed light on how to enforce prompt diversity toward LLM reasoning\nand thereby improve the pareto frontier of the accuracy-cost trade-off.\n","authors":["Ranjita Naik","Varun Chandrasekaran","Mert Yuksekgonul","Hamid Palangi","Besmira Nushi"],"pdf_url":"https://arxiv.org/pdf/2310.07088v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.10966v5","updated":"2023-10-11T23:59:52Z","published":"2023-09-19T23:39:07Z","title":"MBR and QE Finetuning: Training-time Distillation of the Best and Most\n Expensive Decoding Methods","summary":" Recent research in decoding methods for Natural Language Generation (NLG)\ntasks has shown that MAP decoding is not optimal, because model probabilities\ndo not always align with human preferences. Stronger decoding methods,\nincluding Quality Estimation (QE) reranking and Minimum Bayes' Risk (MBR)\ndecoding, have since been proposed to mitigate the model-perplexity-vs-quality\nmismatch. While these decoding methods achieve state-of-the-art performance,\nthey are prohibitively expensive to compute. In this work, we propose MBR\nfinetuning and QE finetuning which distill the quality gains from these\ndecoding methods at training time, while using an efficient decoding algorithm\nat inference time. Using the canonical NLG task of Neural Machine Translation\n(NMT), we show that even with self-training, these finetuning methods\nsignificantly outperform the base model. Moreover, when using an external LLM\nas a teacher model, these finetuning methods outperform finetuning on\nhuman-generated references. These findings suggest new ways to leverage\nmonolingual data to achieve improvements in model quality that are on par with,\nor even exceed, improvements from human-curated data, while maintaining maximum\nefficiency during decoding.\n","authors":["Mara Finkelstein","Subhajit Naskar","Mehdi Mirzazadeh","Apurva Shah","Markus Freitag"],"pdf_url":"https://arxiv.org/pdf/2309.10966v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.03135v3","updated":"2023-10-11T23:38:36Z","published":"2023-07-06T17:05:26Z","title":"Distilling Large Vision-Language Model with Out-of-Distribution\n Generalizability","summary":" Large vision-language models have achieved outstanding performance, but their\nsize and computational requirements make their deployment on\nresource-constrained devices and time-sensitive tasks impractical. Model\ndistillation, the process of creating smaller, faster models that maintain the\nperformance of larger models, is a promising direction towards the solution.\nThis paper investigates the distillation of visual representations in large\nteacher vision-language models into lightweight student models using a small-\nor mid-scale dataset. Notably, this study focuses on open-vocabulary\nout-of-distribution (OOD) generalization, a challenging problem that has been\noverlooked in previous model distillation literature. We propose two principles\nfrom vision and language modality perspectives to enhance student's OOD\ngeneralization: (1) by better imitating teacher's visual representation space,\nand carefully promoting better coherence in vision-language alignment with the\nteacher; (2) by enriching the teacher's language representations with\ninformative and finegrained semantic attributes to effectively distinguish\nbetween different labels. We propose several metrics and conduct extensive\nexperiments to investigate their techniques. The results demonstrate\nsignificant improvements in zero-shot and few-shot student performance on\nopen-vocabulary out-of-distribution classification, highlighting the\neffectiveness of our proposed approaches. Poster:\nhttps://xuanlinli17.github.io/pdfs/iccv23_large_vlm_distillation_poster.pdf\nCode: https://github.com/xuanlinli17/large_vlm_distillation_ood\n","authors":["Xuanlin Li","Yunhao Fang","Minghua Liu","Zhan Ling","Zhuowen Tu","Hao Su"],"pdf_url":"https://arxiv.org/pdf/2307.03135v3.pdf","comment":"Published at International Conference on Computer Vision (ICCV) 2023.\n Poster at\n https://xuanlinli17.github.io/pdfs/iccv23_large_vlm_distillation_poster.pdf"},{"id":"http://arxiv.org/abs/2310.07931v1","updated":"2023-10-11T23:01:29Z","published":"2023-10-11T23:01:29Z","title":"D2 Pruning: Message Passing for Balancing Diversity and Difficulty in\n Data Pruning","summary":" Analytical theories suggest that higher-quality data can lead to lower test\nerrors in models trained on a fixed data budget. Moreover, a model can be\ntrained on a lower compute budget without compromising performance if a dataset\ncan be stripped of its redundancies. Coreset selection (or data pruning) seeks\nto select a subset of the training data so as to maximize the performance of\nmodels trained on this subset, also referred to as coreset. There are two\ndominant approaches: (1) geometry-based data selection for maximizing data\ndiversity in the coreset, and (2) functions that assign difficulty scores to\nsamples based on training dynamics. Optimizing for data diversity leads to a\ncoreset that is biased towards easier samples, whereas, selection by difficulty\nranking omits easy samples that are necessary for the training of deep learning\nmodels. This demonstrates that data diversity and importance scores are two\ncomplementary factors that need to be jointly considered during coreset\nselection. We represent a dataset as an undirected graph and propose a novel\npruning algorithm, D2 Pruning, that uses forward and reverse message passing\nover this dataset graph for coreset selection. D2 Pruning updates the\ndifficulty scores of each example by incorporating the difficulty of its\nneighboring examples in the dataset graph. Then, these updated difficulty\nscores direct a graph-based sampling method to select a coreset that\nencapsulates both diverse and difficult regions of the dataset space. We\nevaluate supervised and self-supervised versions of our method on various\nvision and language datasets. Results show that D2 Pruning improves coreset\nselection over previous state-of-the-art methods for up to 70% pruning rates.\nAdditionally, we find that using D2 Pruning for filtering large multimodal\ndatasets leads to increased diversity in the dataset and improved\ngeneralization of pretrained models.\n","authors":["Adyasha Maharana","Prateek Yadav","Mohit Bansal"],"pdf_url":"https://arxiv.org/pdf/2310.07931v1.pdf","comment":"17 pages (Our code is available at\n https://github.com/adymaharana/d2pruning)"},{"id":"http://arxiv.org/abs/2310.07929v1","updated":"2023-10-11T22:57:03Z","published":"2023-10-11T22:57:03Z","title":"Crosslingual Structural Priming and the Pre-Training Dynamics of\n Bilingual Language Models","summary":" Do multilingual language models share abstract grammatical representations\nacross languages, and if so, when do these develop? Following Sinclair et al.\n(2022), we use structural priming to test for abstract grammatical\nrepresentations with causal effects on model outputs. We extend the approach to\na Dutch-English bilingual setting, and we evaluate a Dutch-English language\nmodel during pre-training. We find that crosslingual structural priming effects\nemerge early after exposure to the second language, with less than 1M tokens of\ndata in that language. We discuss implications for data contamination,\nlow-resource transfer, and how abstract grammatical representations emerge in\nmultilingual models.\n","authors":["Catherine Arnett","Tyler A. Chang","James A. Michaelov","Benjamin K. Bergen"],"pdf_url":"https://arxiv.org/pdf/2310.07929v1.pdf","comment":"Extended abstract accepted to the 3rd Multilingual Representation\n Learning workshop at EMNLP 2023"},{"id":"http://arxiv.org/abs/2308.03853v2","updated":"2023-10-11T22:41:37Z","published":"2023-08-07T18:03:10Z","title":"Exploring zero-shot capability of large language models in inferences\n from medical oncology notes","summary":" Both medical care and observational studies in oncology require a thorough\nunderstanding of a patient's disease progression and treatment history, often\nelaborately documented in clinical notes. Despite their vital role, no current\noncology information representation and annotation schema fully encapsulates\nthe diversity of information recorded within these notes. Although large\nlanguage models (LLMs) have recently exhibited impressive performance on\nvarious medical natural language processing tasks, due to the current lack of\ncomprehensively annotated oncology datasets, an extensive evaluation of LLMs in\nextracting and reasoning with the complex rhetoric in oncology notes remains\nunderstudied. We developed a detailed schema for annotating textual oncology\ninformation, encompassing patient characteristics, tumor characteristics,\ntests, treatments, and temporality. Using a corpus of 40 de-identified breast\nand pancreatic cancer progress notes at University of California, San\nFrancisco, we applied this schema to assess the abilities of three\nrecently-released LLMs (GPT-4, GPT-3.5-turbo, and FLAN-UL2) to perform\nzero-shot extraction of detailed oncological history from two narrative\nsections of clinical progress notes. Our team annotated 9028 entities, 9986\nmodifiers, and 5312 relationships. The GPT-4 model exhibited overall best\nperformance, with an average BLEU score of 0.68, an average ROUGE score of\n0.71, and an average accuracy of 67% on complex tasks (expert manual evaluation\non subset). Notably, it was proficient in tumor characteristic and medication\nextraction, and demonstrated superior performance in advanced tasks of\ninferring symptoms due to cancer and considerations of future medications.\nGPT-4 may already be usable to extract important facts from cancer progress\nnotes needed for clinical research, complex population management, and\ndocumenting quality patient care.\n","authors":["Madhumita Sushil","Vanessa E. Kennedy","Divneet Mandair","Brenda Y. Miao","Travis Zack","Atul J. Butte"],"pdf_url":"https://arxiv.org/pdf/2308.03853v2.pdf","comment":"Source code available at:\n https://github.com/MadhumitaSushil/OncLLMExtraction"},{"id":"http://arxiv.org/abs/2310.07923v1","updated":"2023-10-11T22:35:18Z","published":"2023-10-11T22:35:18Z","title":"The Expresssive Power of Transformers with Chain of Thought","summary":" Recent theoretical work has identified surprisingly simple reasoning\nproblems, such as checking if two nodes in a graph are connected or simulating\nfinite-state machines, that are provably unsolvable by standard transformers\nthat answer immediately after reading their input. However, in practice,\ntransformers' reasoning can be improved by allowing them to use a \"chain of\nthought\" or \"scratchpad\", i.e., generate and condition on a sequence of\nintermediate tokens before answering. Motivated by this, we ask: Does such\nintermediate generation fundamentally extend the computational power of a\ndecoder-only transformer? We show that the answer is yes, but the amount of\nincrease depends crucially on the amount of intermediate generation. For\ninstance, we find that transformer decoders with a logarithmic number of\ndecoding steps (w.r.t. the input length) push the limits of standard\ntransformers only slightly, while a linear number of decoding steps adds a\nclear new ability (under standard complexity conjectures): recognizing all\nregular languages. Our results also imply that linear steps keep transformer\ndecoders within context-sensitive languages, and polynomial steps make them\nrecognize exactly the class of polynomial-time solvable problems -- the first\nexact characterization of a type of transformers in terms of standard\ncomplexity classes. Together, our results provide a nuanced framework for\nunderstanding how the length of a transformer's chain of thought or scratchpad\nimpacts its reasoning power.\n","authors":["William Merrill","Ashish Sabharwal"],"pdf_url":"https://arxiv.org/pdf/2310.07923v1.pdf","comment":"9-page preprint"},{"id":"http://arxiv.org/abs/2310.07911v1","updated":"2023-10-11T21:38:40Z","published":"2023-10-11T21:38:40Z","title":"Pit One Against Many: Leveraging Attention-head Embeddings for\n Parameter-efficient Multi-head Attention","summary":" Scaling pre-trained language models has resulted in large performance gains\nin various natural language processing tasks but comes with a large cost in\nmemory requirements. Inspired by the position embeddings in transformers, we\naim to simplify and reduce the memory footprint of the multi-head attention\n(MHA) mechanism. We propose an alternative module that uses only a single\nshared projection matrix and multiple head embeddings (MHE), i.e. one per head.\nWe empirically demonstrate that our MHE attention is substantially more memory\nefficient compared to alternative attention mechanisms while achieving high\npredictive performance retention ratio to vanilla MHA on several downstream\ntasks. MHE attention only requires a negligible fraction of additional\nparameters ($3nd$, where $n$ is the number of attention heads and $d$ the size\nof the head embeddings) compared to a single-head attention, while MHA requires\n$(3n^2-3n)d^2-3nd$ additional parameters.\n","authors":["Huiyin Xue","Nikolaos Aletras"],"pdf_url":"https://arxiv.org/pdf/2310.07911v1.pdf","comment":"Accepted at EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2305.09656v3","updated":"2023-10-11T21:38:22Z","published":"2023-05-16T17:55:51Z","title":"SatLM: Satisfiability-Aided Language Models Using Declarative Prompting","summary":" Prior work has combined chain-of-thought prompting in large language models\n(LLMs) with programmatic representations to perform effective and transparent\nreasoning. While such an approach works well for tasks that only require\nforward reasoning (e.g., straightforward arithmetic), it is less effective for\nconstraint solving problems that require more sophisticated planning and\nsearch. In this paper, we propose a new satisfiability-aided language modeling\n(SatLM) approach for improving the reasoning capabilities of LLMs. We use an\nLLM to generate a declarative task specification rather than an imperative\nprogram and leverage an off-the-shelf automated theorem prover to derive the\nfinal answer. This approach has two key advantages. The declarative\nspecification is closer to the problem description than the reasoning steps\nare, so the LLM can parse it out of the description more accurately.\nFurthermore, by offloading the actual reasoning task to an automated theorem\nprover, our approach can guarantee the correctness of the answer with respect\nto the parsed specification and avoid planning errors in the solving process.\nWe evaluate SATLM on 8 different datasets and show that it consistently\noutperforms program-aided LMs in the imperative paradigm. In particular, SATLM\noutperforms program-aided LMs by 23% on a challenging subset of the GSM\narithmetic reasoning dataset; SATLM also achieves a new SoTA on LSAT and\nBoardgameQA, surpassing previous models that are trained on the respective\ntraining sets.\n","authors":["Xi Ye","Qiaochu Chen","Isil Dillig","Greg Durrett"],"pdf_url":"https://arxiv.org/pdf/2305.09656v3.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.07889v1","updated":"2023-10-11T20:52:30Z","published":"2023-10-11T20:52:30Z","title":"LangNav: Language as a Perceptual Representation for Navigation","summary":" We explore the use of language as a perceptual representation for\nvision-and-language navigation. Our approach uses off-the-shelf vision systems\n(for image captioning and object detection) to convert an agent's egocentric\npanoramic view at each time step into natural language descriptions. We then\nfinetune a pretrained language model to select an action, based on the current\nview and the trajectory history, that would best fulfill the navigation\ninstructions. In contrast to the standard setup which adapts a pretrained\nlanguage model to work directly with continuous visual features from pretrained\nvision models, our approach instead uses (discrete) language as the perceptual\nrepresentation. We explore two use cases of our language-based navigation\n(LangNav) approach on the R2R vision-and-language navigation benchmark:\ngenerating synthetic trajectories from a prompted large language model (GPT-4)\nwith which to finetune a smaller language model; and sim-to-real transfer where\nwe transfer a policy learned on a simulated environment (ALFRED) to a\nreal-world environment (R2R). Our approach is found to improve upon strong\nbaselines that rely on visual features in settings where only a few gold\ntrajectories (10-100) are available, demonstrating the potential of using\nlanguage as a perceptual representation for navigation tasks.\n","authors":["Bowen Pan","Rameswar Panda","SouYoung Jin","Rogerio Feris","Aude Oliva","Phillip Isola","Yoon Kim"],"pdf_url":"https://arxiv.org/pdf/2310.07889v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07875v1","updated":"2023-10-11T20:34:42Z","published":"2023-10-11T20:34:42Z","title":"TabLib: A Dataset of 627M Tables with Context","summary":" It is well-established that large, diverse datasets play a pivotal role in\nthe performance of modern AI systems for text and image modalities. However,\nthere are no datasets for tabular data of comparable size and diversity to\nthose available for text and images. Thus we present \"TabLib'', a compilation\nof 627 million tables totaling 69 TiB, along with 867B tokens of context.\nTabLib was extracted from numerous file formats, including CSV, HTML, SQLite,\nPDF, Excel, and others, sourced from GitHub and Common Crawl. The size and\ndiversity of TabLib offer considerable promise in the table modality,\nreminiscent of the original promise of foundational datasets for text and\nimages, such as The Pile and LAION.\n","authors":["Gus Eggert","Kevin Huo","Mike Biven","Justin Waugh"],"pdf_url":"https://arxiv.org/pdf/2310.07875v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07856v1","updated":"2023-10-11T19:58:07Z","published":"2023-10-11T19:58:07Z","title":"Assessing Evaluation Metrics for Neural Test Oracle Generation","summary":" In this work, we revisit existing oracle generation studies plus ChatGPT to\nempirically investigate the current standing of their performance in both\nNLG-based and test adequacy metrics. Specifically, we train and run four\nstate-of-the-art test oracle generation models on five NLG-based and two test\nadequacy metrics for our analysis. We apply two different correlation analyses\nbetween these two different sets of metrics. Surprisingly, we found no\nsignificant correlation between the NLG-based metrics and test adequacy\nmetrics. For instance, oracles generated from ChatGPT on the project\nactivemq-artemis had the highest performance on all the NLG-based metrics among\nthe studied NOGs, however, it had the most number of projects with a decrease\nin test adequacy metrics compared to all the studied NOGs. We further conduct a\nqualitative analysis to explore the reasons behind our observations, we found\nthat oracles with high NLG-based metrics but low test adequacy metrics tend to\nhave complex or multiple chained method invocations within the oracle's\nparameters, making it hard for the model to generate completely, affecting the\ntest adequacy metrics. On the other hand, oracles with low NLG-based metrics\nbut high test adequacy metrics tend to have to call different assertion types\nor a different method that functions similarly to the ones in the ground truth.\nOverall, this work complements prior studies on test oracle generation with an\nextensive performance evaluation with both NLG and test adequacy metrics and\nprovides guidelines for better assessment of deep learning applications in\nsoftware test generation in the future.\n","authors":["Jiho Shin","Hadi Hemmati","Moshi Wei","Song Wang"],"pdf_url":"https://arxiv.org/pdf/2310.07856v1.pdf","comment":"10 pages + reference"},{"id":"http://arxiv.org/abs/2305.09548v2","updated":"2023-10-11T19:57:43Z","published":"2023-05-16T15:45:59Z","title":"Measuring Social Dimensions of Self-Presentation in Social Media\n Biographies with an Identity-based Approach","summary":" Social media users on sites like Twitter, Instagram, and Tiktok use the\nprofile description, or bio, field of user profiles to present themselves to\nthe world. In contrast to the ``offline'' world, where social context often\nencourages us to adopt a single identity, the profile description is a\nfree-text field in which users are encouraged to present the self using\nmultiple, sometimes conflicting, social identities. While sociologists, social\npsychologists, sociolinguists, and increasingly computational social\nscientists, have developed a large and growing array of methods to estimate the\nmeaning of individual social identities, little work has attended to the ways\nin which social meanings emerge from the collections of social identities\npresent in social media bios. The present work proposes and evaluate three\nnovel, identity-based methods to measure the social dimensions of meaning\nexpressed in Twitter bios. We show that these models outperform reasonable\nbaselines with respect to 1) predicting which sets of identities are more\nlikely to co-occur within a single biography and 2) quantifying perceptions of\nentire social media biographies along salient dimensions of social meaning on\nTwitter, in particular partisanship. We demonstrate the utility of our method\nin a computational social science setting by using model outputs to better\nunderstand how self presentation along dimensions of partisanship, religion,\nage, and gender are related to the sharing of URLs on Twitter from low versus\nhigh quality news sites.\n","authors":["Navid Madani","Rabiraj Bandyopadhyay","Briony Swire-Thompson","Michael Miller Yoder","Kenneth Joseph"],"pdf_url":"https://arxiv.org/pdf/2305.09548v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07849v1","updated":"2023-10-11T19:51:13Z","published":"2023-10-11T19:51:13Z","title":"Synthetic Data Generation with Large Language Models for Text\n Classification: Potential and Limitations","summary":" The collection and curation of high-quality training data is crucial for\ndeveloping text classification models with superior performance, but it is\noften associated with significant costs and time investment. Researchers have\nrecently explored using large language models (LLMs) to generate synthetic\ndatasets as an alternative approach. However, the effectiveness of the\nLLM-generated synthetic data in supporting model training is inconsistent\nacross different classification tasks. To better understand factors that\nmoderate the effectiveness of the LLM-generated synthetic data, in this study,\nwe look into how the performance of models trained on these synthetic data may\nvary with the subjectivity of classification. Our results indicate that\nsubjectivity, at both the task level and instance level, is negatively\nassociated with the performance of the model trained on synthetic data. We\nconclude by discussing the implications of our work on the potential and\nlimitations of leveraging LLM for synthetic data generation.\n","authors":["Zhuoyan Li","Hangxiao Zhu","Zhuoran Lu","Ming Yin"],"pdf_url":"https://arxiv.org/pdf/2310.07849v1.pdf","comment":"Accepted to EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.07848v1","updated":"2023-10-11T19:50:59Z","published":"2023-10-11T19:50:59Z","title":"Framework for Question-Answering in Sanskrit through Automated\n Construction of Knowledge Graphs","summary":" Sanskrit (sa\\d{m}sk\\d{r}ta) enjoys one of the largest and most varied\nliterature in the whole world. Extracting the knowledge from it, however, is a\nchallenging task due to multiple reasons including complexity of the language\nand paucity of standard natural language processing tools. In this paper, we\ntarget the problem of building knowledge graphs for particular types of\nrelationships from sa\\d{m}sk\\d{r}ta texts. We build a natural language\nquestion-answering system in sa\\d{m}sk\\d{r}ta that uses the knowledge graph to\nanswer factoid questions. We design a framework for the overall system and\nimplement two separate instances of the system on human relationships from\nmah\\=abh\\=arata and r\\=am\\=aya\\d{n}a, and one instance on synonymous\nrelationships from bh\\=avaprak\\=a\\'sa nigha\\d{n}\\d{t}u, a technical text from\n\\=ayurveda. We show that about 50% of the factoid questions can be answered\ncorrectly by the system. More importantly, we analyse the shortcomings of the\nsystem in detail for each step, and discuss the possible ways forward.\n","authors":["Hrishikesh Terdalkar","Arnab Bhattacharya"],"pdf_url":"https://arxiv.org/pdf/2310.07848v1.pdf","comment":"Accepted at 6th International Sanskrit Computational Linguistics\n Symposium (ISCLS) 2019"},{"id":"http://arxiv.org/abs/2310.07830v1","updated":"2023-10-11T19:16:09Z","published":"2023-10-11T19:16:09Z","title":"Does Synthetic Data Make Large Language Models More Efficient?","summary":" Natural Language Processing (NLP) has undergone transformative changes with\nthe advent of deep learning methodologies. One challenge persistently\nconfronting researchers is the scarcity of high-quality, annotated datasets\nthat drive these models. This paper explores the nuances of synthetic data\ngeneration in NLP, with a focal point on template-based question generation. By\nassessing its advantages, including data augmentation potential and the\nintroduction of structured variety, we juxtapose these benefits against\ninherent limitations, such as the risk of overfitting and the constraints posed\nby pre-defined templates. Drawing from empirical evaluations, we demonstrate\nthe impact of template-based synthetic data on the performance of modern\ntransformer models. We conclude by emphasizing the delicate balance required\nbetween synthetic and real-world data, and the future trajectories of\nintegrating synthetic data in model training pipelines. The findings aim to\nguide NLP practitioners in harnessing synthetic data's potential, ensuring\noptimal model performance in diverse applications.\n","authors":["Sia Gholami","Marwan Omar"],"pdf_url":"https://arxiv.org/pdf/2310.07830v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07826v1","updated":"2023-10-11T19:09:07Z","published":"2023-10-11T19:09:07Z","title":"Antarlekhaka: A Comprehensive Tool for Multi-task Natural Language\n Annotation","summary":" One of the primary obstacles in the advancement of Natural Language\nProcessing (NLP) technologies for low-resource languages is the lack of\nannotated datasets for training and testing machine learning models. In this\npaper, we present Antarlekhaka, a tool for manual annotation of a comprehensive\nset of tasks relevant to NLP. The tool is Unicode-compatible,\nlanguage-agnostic, Web-deployable and supports distributed annotation by\nmultiple simultaneous annotators. The system sports user-friendly interfaces\nfor 8 categories of annotation tasks. These, in turn, enable the annotation of\na considerably larger set of NLP tasks. The task categories include two\nlinguistic tasks not handled by any other tool, namely, sentence boundary\ndetection and deciding canonical word order, which are important tasks for text\nthat is in the form of poetry. We propose the idea of sequential annotation\nbased on small text units, where an annotator performs several tasks related to\na single text unit before proceeding to the next unit. The research\napplications of the proposed mode of multi-task annotation are also discussed.\nAntarlekhaka outperforms other annotation tools in objective evaluation. It has\nbeen also used for two real-life annotation tasks on two different languages,\nnamely, Sanskrit and Bengali. The tool is available at\nhttps://github.com/Antarlekhaka/code.\n","authors":["Hrishikesh Terdalkar","Arnab Bhattacharya"],"pdf_url":"https://arxiv.org/pdf/2310.07826v1.pdf","comment":"Accepted: 3rd Workshop for Natural Language Processing Open Source\n Software (NLP-OSS) @ EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.07821v1","updated":"2023-10-11T19:02:57Z","published":"2023-10-11T19:02:57Z","title":"Non-autoregressive Text Editing with Copy-aware Latent Alignments","summary":" Recent work has witnessed a paradigm shift from Seq2Seq to Seq2Edit in the\nfield of text editing, with the aim of addressing the slow autoregressive\ninference problem posed by the former. Despite promising results, Seq2Edit\napproaches still face several challenges such as inflexibility in generation\nand difficulty in generalizing to other languages. In this work, we propose a\nnovel non-autoregressive text editing method to circumvent the above issues, by\nmodeling the edit process with latent CTC alignments. We make a crucial\nextension to CTC by introducing the copy operation into the edit space, thus\nenabling more efficient management of textual overlap in editing. We conduct\nextensive experiments on GEC and sentence fusion tasks, showing that our\nproposed method significantly outperforms existing Seq2Edit models and achieves\nsimilar or even better results than Seq2Seq with over $4\\times$ speedup.\nMoreover, it demonstrates good generalizability on German and Russian. In-depth\nanalyses reveal the strengths of our method in terms of the robustness under\nvarious scenarios and generating fluent and flexible outputs.\n","authors":["Yu Zhang","Yue Zhang","Leyang Cui","Guohong Fu"],"pdf_url":"https://arxiv.org/pdf/2310.07821v1.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.07819v1","updated":"2023-10-11T19:00:40Z","published":"2023-10-11T19:00:40Z","title":"Faithfulness Measurable Masked Language Models","summary":" A common approach to explain NLP models, is to use importance measures that\nexpress which tokens are important for a prediction. Unfortunately, such\nexplanations are often wrong despite being persuasive. Therefore, it is\nessential to measure their faithfulness. One such metric is if tokens are truly\nimportant, then masking them should result in worse model performance. However,\ntoken masking introduces out-of-distribution issues and existing solutions are\ncomputationally expensive and employ proxy-models. Furthermore, other metrics\nare very limited in scope. In this work, we propose an inherently faithfulness\nmeasurable model that addresses these challenges. This is achieved by using a\nnovel fine-tuning method that incorporates masking, such that masking tokens\nbecome in-distribution by design. This differs from existing approaches, which\nare completely model-agnostic but are inapplicable in practice. We demonstrate\nthe generality of our approach by applying it to various tasks and validate it\nusing statistical in-distribution tests. Additionally, because masking is\nin-distribution, importance measures which themselves use masking become more\nfaithful, thus our model becomes more explainable.\n","authors":["Andreas Madsen","Siva Reddy","Sarath Chandar"],"pdf_url":"https://arxiv.org/pdf/2310.07819v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07818v1","updated":"2023-10-11T18:59:48Z","published":"2023-10-11T18:59:48Z","title":"Exploring the Relationship between Analogy Identification and Sentence\n Structure Encoding in Large Language Models","summary":" Identifying analogies plays a pivotal role in human cognition and language\nproficiency. In the last decade, there has been extensive research on word\nanalogies in the form of ``A is to B as C is to D.'' However, there is a\ngrowing interest in analogies that involve longer text, such as sentences and\ncollections of sentences, which convey analogous meanings. While the current\nNLP research community evaluates the ability of Large Language Models (LLMs) to\nidentify such analogies, the underlying reasons behind these abilities warrant\ndeeper investigation. Furthermore, the capability of LLMs to encode both\nsyntactic and semantic structures of language within their embeddings has\ngarnered significant attention with the surge in their utilization. In this\nwork, we examine the relationship between the abilities of multiple LLMs to\nidentify sentence analogies, and their capacity to encode syntactic and\nsemantic structures. Through our analysis, we find that analogy identification\nability of LLMs is positively correlated with their ability to encode syntactic\nand semantic structures of sentences. Specifically, we find that the LLMs which\ncapture syntactic structures better, also have higher abilities in identifying\nsentence analogies.\n","authors":["Thilini Wijesiriwardene","Ruwan Wickramarachchi","Aishwarya Naresh Reganti","Vinija Jain","Aman Chadha","Amit Sheth","Amitava Das"],"pdf_url":"https://arxiv.org/pdf/2310.07818v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07815v1","updated":"2023-10-11T18:56:15Z","published":"2023-10-11T18:56:15Z","title":"Language Models As Semantic Indexers","summary":" Semantic identifier (ID) is an important concept in information retrieval\nthat aims to preserve the semantics of objects such as documents and items\ninside their IDs. Previous studies typically adopt a two-stage pipeline to\nlearn semantic IDs by first procuring embeddings using off-the-shelf text\nencoders and then deriving IDs based on the embeddings. However, each step\nintroduces potential information loss and there is usually an inherent mismatch\nbetween the distribution of embeddings within the latent space produced by text\nencoders and the anticipated distribution required for semantic indexing.\nNevertheless, it is non-trivial to design a method that can learn the\ndocument's semantic representations and its hierarchical structure\nsimultaneously, given that semantic IDs are discrete and sequentially\nstructured, and the semantic supervision is deficient. In this paper, we\nintroduce LMINDEXER, a self-supervised framework to learn semantic IDs with a\ngenerative language model. We tackle the challenge of sequential discrete ID by\nintroducing a semantic indexer capable of generating neural sequential discrete\nrepresentations with progressive training and contrastive learning. In response\nto the semantic supervision deficiency, we propose to train the model with a\nself-supervised document reconstruction objective. The learned semantic indexer\ncan facilitate various downstream tasks, such as recommendation and retrieval.\nWe conduct experiments on three tasks including recommendation, product search,\nand document retrieval on five datasets from various domains, where LMINDEXER\noutperforms competitive baselines significantly and consistently.\n","authors":["Bowen Jin","Hansi Zeng","Guoyin Wang","Xiusi Chen","Tianxin Wei","Ruirui Li","Zhengyang Wang","Zheng Li","Yang Li","Hanqing Lu","Suhang Wang","Jiawei Han","Xianfeng Tang"],"pdf_url":"https://arxiv.org/pdf/2310.07815v1.pdf","comment":"9 pages, 3 appendix pages"},{"id":"http://arxiv.org/abs/2310.07803v1","updated":"2023-10-11T18:36:13Z","published":"2023-10-11T18:36:13Z","title":"A general mechanism of humor: reformulating the semantic overlap","summary":" This article proposes a cognitive mechanism of humour of general\napplicability, not restricted to verbal communication. It is indebted to\nRaskin's concept of script overlap, and conforms to the incongruity-resolution\ntheoretical framework, but it is built on the notion of constraint, an abstract\ncorrespondence between sets of data. Under this view, script overlap is an\noutcome of a more abstractly described phenomenon, constraint overlap. The\nimportant concept of the overlooked argument is introduced to characterise the\ntwo overlapping constraints -- overt and covert. Their inputs and outputs are\nnot directly encoded in utterances, but implicated by them, and their overlap\nresults in another overlap at the level of the communicated utterances, that\nthe incongruity reveals. Our hypothesis assumes as a given that the evocation\nof such constraints is a cognitive effect of the inferential process by which a\nhearer interprets utterances. We base this assumption on Hofstadter's theory of\nanalogy-making as the essence of human thought. By substituting \"stimuli\" of\nany kind for \"utterances\" in this model, we obtain a mechanism as easily\napplicable to non-verbal communication -- slapstick, cartoons -- and we propose\nit describes the necessary and sufficient conditions for a communicative act in\nany modality to carry humour.\n","authors":["Javier Martínez"],"pdf_url":"https://arxiv.org/pdf/2310.07803v1.pdf","comment":"24 pages, 8 figures"},{"id":"http://arxiv.org/abs/2310.07795v1","updated":"2023-10-11T18:30:37Z","published":"2023-10-11T18:30:37Z","title":"Ontology Enrichment for Effective Fine-grained Entity Typing","summary":" Fine-grained entity typing (FET) is the task of identifying specific entity\ntypes at a fine-grained level for entity mentions based on their contextual\ninformation. Conventional methods for FET require extensive human annotation,\nwhich is time-consuming and costly. Recent studies have been developing weakly\nsupervised or zero-shot approaches. We study the setting of zero-shot FET where\nonly an ontology is provided. However, most existing ontology structures lack\nrich supporting information and even contain ambiguous relations, making them\nineffective in guiding FET. Recently developed language models, though\npromising in various few-shot and zero-shot NLP tasks, may face challenges in\nzero-shot FET due to their lack of interaction with task-specific ontology. In\nthis study, we propose OnEFET, where we (1) enrich each node in the ontology\nstructure with two types of extra information: instance information for\ntraining sample augmentation and topic information to relate types to contexts,\nand (2) develop a coarse-to-fine typing algorithm that exploits the enriched\ninformation by training an entailment model with contrasting topics and\ninstance-based augmented training samples. Our experiments show that OnEFET\nachieves high-quality fine-grained entity typing without human annotation,\noutperforming existing zero-shot methods by a large margin and rivaling\nsupervised methods.\n","authors":["Siru Ouyang","Jiaxin Huang","Pranav Pillai","Yunyi Zhang","Yu Zhang","Jiawei Han"],"pdf_url":"https://arxiv.org/pdf/2310.07795v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07793v1","updated":"2023-10-11T18:27:12Z","published":"2023-10-11T18:27:12Z","title":"GenTKG: Generative Forecasting on Temporal Knowledge Graph","summary":" The rapid advancements in large language models (LLMs) have ignited interest\nin the temporal knowledge graph (tKG) domain, where conventional carefully\ndesigned embedding-based and rule-based models dominate. The question remains\nopen of whether pre-trained LLMs can understand structured temporal relational\ndata and replace them as the foundation model for temporal relational\nforecasting. Therefore, we bring temporal knowledge forecasting into the\ngenerative setting. However, challenges occur in the huge chasms between\ncomplex temporal graph data structure and sequential natural expressions LLMs\ncan handle, and between the enormous data sizes of tKGs and heavy computation\ncosts of finetuning LLMs. To address these challenges, we propose a novel\nretrieval augmented generation framework that performs generative forecasting\non tKGs named GenTKG, which combines a temporal logical rule-based retrieval\nstrategy and lightweight parameter-efficient instruction tuning. Extensive\nexperiments have shown that GenTKG outperforms conventional methods of temporal\nrelational forecasting under low computation resources. GenTKG also highlights\nremarkable transferability with exceeding performance on unseen datasets\nwithout re-training. Our work reveals the huge potential of LLMs in the tKG\ndomain and opens a new frontier for generative forecasting on tKGs.\n","authors":["Ruotong Liao","Xu Jia","Yunpu Ma","Volker Tresp"],"pdf_url":"https://arxiv.org/pdf/2310.07793v1.pdf","comment":"8 pages"}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2310.07716v1","updated":"2023-10-11T17:59:56Z","published":"2023-10-11T17:59:56Z","title":"PAD: A Dataset and Benchmark for Pose-agnostic Anomaly Detection","summary":" Object anomaly detection is an important problem in the field of machine\nvision and has seen remarkable progress recently. However, two significant\nchallenges hinder its research and application. First, existing datasets lack\ncomprehensive visual information from various pose angles. They usually have an\nunrealistic assumption that the anomaly-free training dataset is pose-aligned,\nand the testing samples have the same pose as the training data. However, in\npractice, anomaly may exist in any regions on a object, the training and query\nsamples may have different poses, calling for the study on pose-agnostic\nanomaly detection. Second, the absence of a consensus on experimental protocols\nfor pose-agnostic anomaly detection leads to unfair comparisons of different\nmethods, hindering the research on pose-agnostic anomaly detection. To address\nthese issues, we develop Multi-pose Anomaly Detection (MAD) dataset and\nPose-agnostic Anomaly Detection (PAD) benchmark, which takes the first step to\naddress the pose-agnostic anomaly detection problem. Specifically, we build MAD\nusing 20 complex-shaped LEGO toys including 4K views with various poses, and\nhigh-quality and diverse 3D anomalies in both simulated and real environments.\nAdditionally, we propose a novel method OmniposeAD, trained using MAD,\nspecifically designed for pose-agnostic anomaly detection. Through\ncomprehensive evaluations, we demonstrate the relevance of our dataset and\nmethod. Furthermore, we provide an open-source benchmark library, including\ndataset and baseline methods that cover 8 anomaly detection paradigms, to\nfacilitate future research and application in this domain. Code, data, and\nmodels are publicly available at https://github.com/EricLee0224/PAD.\n","authors":["Qiang Zhou","Weize Li","Lihan Jiang","Guoliang Wang","Guyue Zhou","Shanghang Zhang","Hao Zhao"],"pdf_url":"https://arxiv.org/pdf/2310.07716v1.pdf","comment":"Accepted by NeurIPS 2023. Codes are available at\n https://github.com/EricLee0224/PAD"},{"id":"http://arxiv.org/abs/2305.05658v2","updated":"2023-10-11T17:59:44Z","published":"2023-05-09T17:52:59Z","title":"TidyBot: Personalized Robot Assistance with Large Language Models","summary":" For a robot to personalize physical assistance effectively, it must learn\nuser preferences that can be generally reapplied to future scenarios. In this\nwork, we investigate personalization of household cleanup with robots that can\ntidy up rooms by picking up objects and putting them away. A key challenge is\ndetermining the proper place to put each object, as people's preferences can\nvary greatly depending on personal taste or cultural background. For instance,\none person may prefer storing shirts in the drawer, while another may prefer\nthem on the shelf. We aim to build systems that can learn such preferences from\njust a handful of examples via prior interactions with a particular person. We\nshow that robots can combine language-based planning and perception with the\nfew-shot summarization capabilities of large language models (LLMs) to infer\ngeneralized user preferences that are broadly applicable to future\ninteractions. This approach enables fast adaptation and achieves 91.2% accuracy\non unseen objects in our benchmark dataset. We also demonstrate our approach on\na real-world mobile manipulator called TidyBot, which successfully puts away\n85.0% of objects in real-world test scenarios.\n","authors":["Jimmy Wu","Rika Antonova","Adam Kan","Marion Lepert","Andy Zeng","Shuran Song","Jeannette Bohg","Szymon Rusinkiewicz","Thomas Funkhouser"],"pdf_url":"https://arxiv.org/pdf/2305.05658v2.pdf","comment":"Accepted to Autonomous Robots (AuRo) - Special Issue: Large Language\n Models in Robotics, 2023 and IEEE/RSJ International Conference on Intelligent\n Robots and Systems (IROS), 2023. Project page:\n https://tidybot.cs.princeton.edu"},{"id":"http://arxiv.org/abs/2310.07707v1","updated":"2023-10-11T17:57:14Z","published":"2023-10-11T17:57:14Z","title":"MatFormer: Nested Transformer for Elastic Inference","summary":" Transformer models are deployed in a wide range of settings, from\nmulti-accelerator clusters to standalone mobile phones. The diverse inference\nconstraints in these scenarios necessitate practitioners to train foundation\nmodels such as PaLM 2, Llama, & ViTs as a series of models of varying sizes.\nDue to significant training costs, only a select few model sizes are trained\nand supported, limiting more fine-grained control over relevant tradeoffs,\nincluding latency, cost, and accuracy. This work introduces MatFormer, a nested\nTransformer architecture designed to offer elasticity in a variety of\ndeployment constraints. Each Feed Forward Network (FFN) block of a MatFormer\nmodel is jointly optimized with a few nested smaller FFN blocks. This training\nprocedure allows for the Mix'n'Match of model granularities across layers --\ni.e., a trained universal MatFormer model enables extraction of hundreds of\naccurate smaller models, which were never explicitly optimized. We empirically\ndemonstrate MatFormer's effectiveness across different model classes (decoders\n& encoders), modalities (language & vision), and scales (up to 2.6B\nparameters). We find that a 2.6B decoder-only MatFormer language model (MatLM)\nallows us to extract smaller models spanning from 1.5B to 2.6B, each exhibiting\ncomparable validation loss and one-shot downstream evaluations to their\nindependently trained counterparts. Furthermore, we observe that smaller\nencoders extracted from a universal MatFormer-based ViT (MatViT) encoder\npreserve the metric-space structure for adaptive large-scale retrieval.\nFinally, we showcase that speculative decoding with the accurate and consistent\nsubmodels extracted from MatFormer can further reduce inference latency.\n","authors":[" Devvrit","Sneha Kudugunta","Aditya Kusupati","Tim Dettmers","Kaifeng Chen","Inderjit Dhillon","Yulia Tsvetkov","Hannaneh Hajishirzi","Sham Kakade","Ali Farhadi","Prateek Jain"],"pdf_url":"https://arxiv.org/pdf/2310.07707v1.pdf","comment":"31 pages, 12 figures, first three authors contributed equally"},{"id":"http://arxiv.org/abs/2310.07704v1","updated":"2023-10-11T17:55:15Z","published":"2023-10-11T17:55:15Z","title":"Ferret: Refer and Ground Anything Anywhere at Any Granularity","summary":" We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of\nunderstanding spatial referring of any shape or granularity within an image and\naccurately grounding open-vocabulary descriptions. To unify referring and\ngrounding in the LLM paradigm, Ferret employs a novel and powerful hybrid\nregion representation that integrates discrete coordinates and continuous\nfeatures jointly to represent a region in the image. To extract the continuous\nfeatures of versatile regions, we propose a spatial-aware visual sampler, adept\nat handling varying sparsity across different shapes. Consequently, Ferret can\naccept diverse region inputs, such as points, bounding boxes, and free-form\nshapes. To bolster the desired capability of Ferret, we curate GRIT, a\ncomprehensive refer-and-ground instruction tuning dataset including 1.1M\nsamples that contain rich hierarchical spatial knowledge, with 95K hard\nnegative data to promote model robustness. The resulting model not only\nachieves superior performance in classical referring and grounding tasks, but\nalso greatly outperforms existing MLLMs in region-based and\nlocalization-demanded multimodal chatting. Our evaluations also reveal a\nsignificantly improved capability of describing image details and a remarkable\nalleviation in object hallucination. Code and data will be available at\nhttps://github.com/apple/ml-ferret\n","authors":["Haoxuan You","Haotian Zhang","Zhe Gan","Xianzhi Du","Bowen Zhang","Zirui Wang","Liangliang Cao","Shih-Fu Chang","Yinfei Yang"],"pdf_url":"https://arxiv.org/pdf/2310.07704v1.pdf","comment":"30 pages, 10 figures. Code/Project Website:\n https://github.com/apple/ml-ferret"},{"id":"http://arxiv.org/abs/2310.07702v1","updated":"2023-10-11T17:52:39Z","published":"2023-10-11T17:52:39Z","title":"ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with\n Diffusion Models","summary":" In this work, we investigate the capability of generating images from\npre-trained diffusion models at much higher resolutions than the training image\nsizes. In addition, the generated images should have arbitrary image aspect\nratios. When generating images directly at a higher resolution, 1024 x 1024,\nwith the pre-trained Stable Diffusion using training images of resolution 512 x\n512, we observe persistent problems of object repetition and unreasonable\nobject structures. Existing works for higher-resolution generation, such as\nattention-based and joint-diffusion approaches, cannot well address these\nissues. As a new perspective, we examine the structural components of the U-Net\nin diffusion models and identify the crucial cause as the limited perception\nfield of convolutional kernels. Based on this key observation, we propose a\nsimple yet effective re-dilation that can dynamically adjust the convolutional\nperception field during inference. We further propose the dispersed convolution\nand noise-damped classifier-free guidance, which can enable\nultra-high-resolution image generation (e.g., 4096 x 4096). Notably, our\napproach does not require any training or optimization. Extensive experiments\ndemonstrate that our approach can address the repetition issue well and achieve\nstate-of-the-art performance on higher-resolution image synthesis, especially\nin texture details. Our work also suggests that a pre-trained diffusion model\ntrained on low-resolution images can be directly used for high-resolution\nvisual generation without further tuning, which may provide insights for future\nresearch on ultra-high-resolution image and video synthesis.\n","authors":["Yingqing He","Shaoshu Yang","Haoxin Chen","Xiaodong Cun","Menghan Xia","Yong Zhang","Xintao Wang","Ran He","Qifeng Chen","Ying Shan"],"pdf_url":"https://arxiv.org/pdf/2310.07702v1.pdf","comment":"Project page: https://yingqinghe.github.io/scalecrafter/ Github:\n https://github.com/YingqingHe/ScaleCrafter"},{"id":"http://arxiv.org/abs/2310.07699v1","updated":"2023-10-11T17:49:13Z","published":"2023-10-11T17:49:13Z","title":"From Scarcity to Efficiency: Improving CLIP Training via Visual-enriched\n Captions","summary":" Web-crawled datasets are pivotal to the success of pre-training\nvision-language models, exemplified by CLIP. However, web-crawled AltTexts can\nbe noisy and potentially irrelevant to images, thereby undermining the crucial\nimage-text alignment. Existing methods for rewriting captions using large\nlanguage models (LLMs) have shown promise on small, curated datasets like CC3M\nand CC12M. Nevertheless, their efficacy on massive web-captured captions is\nconstrained by the inherent noise and randomness in such data. In this study,\nwe address this limitation by focusing on two key aspects: data quality and\ndata variety. Unlike recent LLM rewriting techniques, we emphasize exploiting\nvisual concepts and their integration into the captions to improve data\nquality. For data variety, we propose a novel mixed training scheme that\noptimally leverages AltTexts alongside newly generated Visual-enriched Captions\n(VeC). We use CLIP as one example and adapt the method for CLIP training on\nlarge-scale web-crawled datasets, named VeCLIP. We conduct a comprehensive\nevaluation of VeCLIP across small, medium, and large scales of raw data. Our\nresults show significant advantages in image-text alignment and overall model\nperformance, underscoring the effectiveness of VeCLIP in improving CLIP\ntraining. For example, VeCLIP achieves a remarkable over 20% improvement in\nCOCO and Flickr30k retrieval tasks under the 12M setting. For data efficiency,\nwe also achieve a notable over 3% improvement while using only 14% of the data\nemployed in the vanilla CLIP and 11% in ALIGN.\n","authors":["Zhengfeng Lai","Haotian Zhang","Wentao Wu","Haoping Bai","Aleksei Timofeev","Xianzhi Du","Zhe Gan","Jiulong Shan","Chen-Nee Chuah","Yinfei Yang","Meng Cao"],"pdf_url":"https://arxiv.org/pdf/2310.07699v1.pdf","comment":"CV/ML"},{"id":"http://arxiv.org/abs/2310.07697v1","updated":"2023-10-11T17:46:28Z","published":"2023-10-11T17:46:28Z","title":"ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation","summary":" Recent works have successfully extended large-scale text-to-image models to\nthe video domain, producing promising results but at a high computational cost\nand requiring a large amount of video data. In this work, we introduce\nConditionVideo, a training-free approach to text-to-video generation based on\nthe provided condition, video, and input text, by leveraging the power of\noff-the-shelf text-to-image generation methods (e.g., Stable Diffusion).\nConditionVideo generates realistic dynamic videos from random noise or given\nscene videos. Our method explicitly disentangles the motion representation into\ncondition-guided and scenery motion components. To this end, the ConditionVideo\nmodel is designed with a UNet branch and a control branch. To improve temporal\ncoherence, we introduce sparse bi-directional spatial-temporal attention\n(sBiST-Attn). The 3D control network extends the conventional 2D controlnet\nmodel, aiming to strengthen conditional generation accuracy by additionally\nleveraging the bi-directional frames in the temporal domain. Our method\nexhibits superior performance in terms of frame consistency, clip score, and\nconditional accuracy, outperforming other compared methods.\n","authors":["Bo Peng","Xinyuan Chen","Yaohui Wang","Chaochao Lu","Yu Qiao"],"pdf_url":"https://arxiv.org/pdf/2310.07697v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.09535v3","updated":"2023-10-11T17:46:21Z","published":"2023-03-16T17:51:13Z","title":"FateZero: Fusing Attentions for Zero-shot Text-based Video Editing","summary":" The diffusion-based generative models have achieved remarkable success in\ntext-based image generation. However, since it contains enormous randomness in\ngeneration progress, it is still challenging to apply such models for\nreal-world visual content editing, especially in videos. In this paper, we\npropose FateZero, a zero-shot text-based editing method on real-world videos\nwithout per-prompt training or use-specific mask. To edit videos consistently,\nwe propose several techniques based on the pre-trained models. Firstly, in\ncontrast to the straightforward DDIM inversion technique, our approach captures\nintermediate attention maps during inversion, which effectively retain both\nstructural and motion information. These maps are directly fused in the editing\nprocess rather than generated during denoising. To further minimize semantic\nleakage of the source video, we then fuse self-attentions with a blending mask\nobtained by cross-attention features from the source prompt. Furthermore, we\nhave implemented a reform of the self-attention mechanism in denoising UNet by\nintroducing spatial-temporal attention to ensure frame consistency. Yet\nsuccinct, our method is the first one to show the ability of zero-shot\ntext-driven video style and local attribute editing from the trained\ntext-to-image model. We also have a better zero-shot shape-aware editing\nability based on the text-to-video model. Extensive experiments demonstrate our\nsuperior temporal consistency and editing capability than previous works.\n","authors":["Chenyang Qi","Xiaodong Cun","Yong Zhang","Chenyang Lei","Xintao Wang","Ying Shan","Qifeng Chen"],"pdf_url":"https://arxiv.org/pdf/2303.09535v3.pdf","comment":"Accepted to ICCV 2023 as an Oral Presentation. Project page:\n https://fate-zero-edit.github.io ; GitHub repository:\n https://github.com/ChenyangQiQi/FateZero"},{"id":"http://arxiv.org/abs/2303.05078v2","updated":"2023-10-11T17:46:03Z","published":"2023-03-09T07:26:49Z","title":"Efficient Transformer-based 3D Object Detection with Dynamic Token\n Halting","summary":" Balancing efficiency and accuracy is a long-standing problem for deploying\ndeep learning models. The trade-off is even more important for real-time\nsafety-critical systems like autonomous vehicles. In this paper, we propose an\neffective approach for accelerating transformer-based 3D object detectors by\ndynamically halting tokens at different layers depending on their contribution\nto the detection task. Although halting a token is a non-differentiable\noperation, our method allows for differentiable end-to-end learning by\nleveraging an equivalent differentiable forward-pass. Furthermore, our\nframework allows halted tokens to be reused to inform the model's predictions\nthrough a straightforward token recycling mechanism. Our method significantly\nimproves the Pareto frontier of efficiency versus accuracy when compared with\nthe existing approaches. By halting tokens and increasing model capacity, we\nare able to improve the baseline model's performance without increasing the\nmodel's latency on the Waymo Open Dataset.\n","authors":["Mao Ye","Gregory P. Meyer","Yuning Chai","Qiang Liu"],"pdf_url":"https://arxiv.org/pdf/2303.05078v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07687v1","updated":"2023-10-11T17:36:17Z","published":"2023-10-11T17:36:17Z","title":"Orbital Polarimetric Tomography of a Flare Near the Sagittarius A*\n Supermassive Black Hole","summary":" The interaction between the supermassive black hole at the center of the\nMilky Way, Sagittarius A$^*$, and its accretion disk, occasionally produces\nhigh energy flares seen in X-ray, infrared and radio. One mechanism for\nobserved flares is the formation of compact bright regions that appear within\nthe accretion disk and close to the event horizon. Understanding these flares\ncan provide a window into black hole accretion processes. Although\nsophisticated simulations predict the formation of these flares, their\nstructure has yet to be recovered by observations. Here we show the first\nthree-dimensional (3D) reconstruction of an emission flare in orbit recovered\nfrom ALMA light curves observed on April 11, 2017. Our recovery results show\ncompact bright regions at a distance of roughly 6 times the event horizon.\nMoreover, our recovery suggests a clockwise rotation in a low-inclination\norbital plane, a result consistent with prior studies by EHT and GRAVITY\ncollaborations. To recover this emission structure we solve a highly ill-posed\ntomography problem by integrating a neural 3D representation (an emergent\nartificial intelligence approach for 3D reconstruction) with a gravitational\nmodel for black holes. Although the recovered 3D structure is subject, and\nsometimes sensitive, to the model assumptions, under physically motivated\nchoices we find that our results are stable and our approach is successful on\nsimulated data. We anticipate that in the future, this approach could be used\nto analyze a richer collection of time-series data that could shed light on the\nmechanisms governing black hole and plasma dynamics.\n","authors":["Aviad Levis","Andrew A. Chael","Katherine L. Bouman","Maciek Wielgus","Pratul P. Srinivasan"],"pdf_url":"https://arxiv.org/pdf/2310.07687v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.12424v2","updated":"2023-10-11T17:34:19Z","published":"2023-06-21T17:59:51Z","title":"VisoGender: A dataset for benchmarking gender bias in image-text pronoun\n resolution","summary":" We introduce VisoGender, a novel dataset for benchmarking gender bias in\nvision-language models. We focus on occupation-related biases within a\nhegemonic system of binary gender, inspired by Winograd and Winogender schemas,\nwhere each image is associated with a caption containing a pronoun relationship\nof subjects and objects in the scene. VisoGender is balanced by gender\nrepresentation in professional roles, supporting bias evaluation in two ways:\ni) resolution bias, where we evaluate the difference between pronoun resolution\naccuracies for image subjects with gender presentations perceived as masculine\nversus feminine by human annotators and ii) retrieval bias, where we compare\nratios of professionals perceived to have masculine and feminine gender\npresentations retrieved for a gender-neutral search query. We benchmark several\nstate-of-the-art vision-language models and find that they demonstrate bias in\nresolving binary gender in complex scenes. While the direction and magnitude of\ngender bias depends on the task and the model being evaluated, captioning\nmodels are generally less biased than Vision-Language Encoders. Dataset and\ncode are available at https://github.com/oxai/visogender\n","authors":["Siobhan Mackenzie Hall","Fernanda Gonçalves Abrantes","Hanwen Zhu","Grace Sodunke","Aleksandar Shtedritski","Hannah Rose Kirk"],"pdf_url":"https://arxiv.org/pdf/2306.12424v2.pdf","comment":"Data and code available at https://github.com/oxai/visogender"},{"id":"http://arxiv.org/abs/2310.07682v1","updated":"2023-10-11T17:32:24Z","published":"2023-10-11T17:32:24Z","title":"Prediction of MET Overexpression in Non-Small Cell Lung Adenocarcinomas\n from Hematoxylin and Eosin Images","summary":" MET protein overexpression is a targetable event in non-small cell lung\ncancer (NSCLC) and is the subject of active drug development. Challenges in\nidentifying patients for these therapies include lack of access to validated\ntesting, such as standardized immunohistochemistry (IHC) assessment, and\nconsumption of valuable tissue for a single gene/protein assay. Development of\npre-screening algorithms using routinely available digitized hematoxylin and\neosin (H&E)-stained slides to predict MET overexpression could promote testing\nfor those who will benefit most. While assessment of MET expression using IHC\nis currently not routinely performed in NSCLC, next-generation sequencing is\ncommon and in some cases includes RNA expression panel testing. In this work,\nwe leveraged a large database of matched H&E slides and RNA expression data to\ntrain a weakly supervised model to predict MET RNA overexpression directly from\nH&E images. This model was evaluated on an independent holdout test set of 300\nover-expressed and 289 normal patients, demonstrating an ROC-AUC of 0.70 (95th\npercentile interval: 0.66 - 0.74) with stable performance characteristics\nacross different patient clinical variables and robust to synthetic noise on\nthe test set. These results suggest that H&E-based predictive models could be\nuseful to prioritize patients for confirmatory testing of MET protein or MET\ngene expression status.\n","authors":["Kshitij Ingale","Sun Hae Hong","Josh S. K. Bell","Abbas Rizvi","Amy Welch","Lingdao Sha","Irvin Ho","Kunal Nagpal","Aicha BenTaieb","Rohan P Joshi","Martin C Stumpe"],"pdf_url":"https://arxiv.org/pdf/2310.07682v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07678v1","updated":"2023-10-11T17:21:48Z","published":"2023-10-11T17:21:48Z","title":"Explainable Image Similarity: Integrating Siamese Networks and Grad-CAM","summary":" With the proliferation of image-based applications in various domains, the\nneed for accurate and interpretable image similarity measures has become\nincreasingly critical. Existing image similarity models often lack\ntransparency, making it challenging to understand the reasons why two images\nare considered similar. In this paper, we propose the concept of explainable\nimage similarity, where the goal is the development of an approach, which is\ncapable of providing similarity scores along with visual factual and\ncounterfactual explanations. Along this line, we present a new framework, which\nintegrates Siamese Networks and Grad-CAM for providing explainable image\nsimilarity and discuss the potential benefits and challenges of adopting this\napproach. In addition, we provide a comprehensive discussion about factual and\ncounterfactual explanations provided by the proposed framework for assisting\ndecision making. The proposed approach has the potential to enhance the\ninterpretability, trustworthiness and user acceptance of image-based systems in\nreal-world image similarity applications. The implementation code can be found\nin https://github.com/ioannislivieris/Grad_CAM_Siamese.git.\n","authors":["Ioannis E. Livieris","Emmanuel Pintelas","Niki Kiriakidou","Panagiotis Pintelas"],"pdf_url":"https://arxiv.org/pdf/2310.07678v1.pdf","comment":"The manuscript has been submitted for publication in \"Journal of\n Imaging\""},{"id":"http://arxiv.org/abs/2310.07669v1","updated":"2023-10-11T17:18:15Z","published":"2023-10-11T17:18:15Z","title":"HaarNet: Large-scale Linear-Morphological Hybrid Network for RGB-D\n Semantic Segmentation","summary":" Signals from different modalities each have their own combination algebra\nwhich affects their sampling processing. RGB is mostly linear; depth is a\ngeometric signal following the operations of mathematical morphology. If a\nnetwork obtaining RGB-D input has both kinds of operators available in its\nlayers, it should be able to give effective output with fewer parameters. In\nthis paper, morphological elements in conjunction with more familiar linear\nmodules are used to construct a mixed linear-morphological network called\nHaarNet. This is the first large-scale linear-morphological hybrid, evaluated\non a set of sizeable real-world datasets. In the network, morphological Haar\nsampling is applied to both feature channels in several layers, which splits\nextreme values and high-frequency information such that both can be processed\nto improve both modalities. Moreover, morphologically parameterised ReLU is\nused, and morphologically-sound up-sampling is applied to obtain a\nfull-resolution output. Experiments show that HaarNet is competitive with a\nstate-of-the-art CNN, implying that morphological networks are a promising\nresearch direction for geometry-based learning tasks.\n","authors":["Rick Groenendijk","Leo Dorst","Theo Gevers"],"pdf_url":"https://arxiv.org/pdf/2310.07669v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07664v1","updated":"2023-10-11T17:09:19Z","published":"2023-10-11T17:09:19Z","title":"Accelerating Vision Transformers Based on Heterogeneous Attention\n Patterns","summary":" Recently, Vision Transformers (ViTs) have attracted a lot of attention in the\nfield of computer vision. Generally, the powerful representative capacity of\nViTs mainly benefits from the self-attention mechanism, which has a high\ncomputation complexity. To accelerate ViTs, we propose an integrated\ncompression pipeline based on observed heterogeneous attention patterns across\nlayers. On one hand, different images share more similar attention patterns in\nearly layers than later layers, indicating that the dynamic query-by-key\nself-attention matrix may be replaced with a static self-attention matrix in\nearly layers. Then, we propose a dynamic-guided static self-attention (DGSSA)\nmethod where the matrix inherits self-attention information from the replaced\ndynamic self-attention to effectively improve the feature representation\nability of ViTs. On the other hand, the attention maps have more low-rank\npatterns, which reflect token redundancy, in later layers than early layers. In\na view of linear dimension reduction, we further propose a method of global\naggregation pyramid (GLAD) to reduce the number of tokens in later layers of\nViTs, such as Deit. Experimentally, the integrated compression pipeline of\nDGSSA and GLAD can accelerate up to 121% run-time throughput compared with\nDeiT, which surpasses all SOTA approaches.\n","authors":["Deli Yu","Teng Xi","Jianwei Li","Baopu Li","Gang Zhang","Haocheng Feng","Junyu Han","Jingtuo Liu","Errui Ding","Jingdong Wang"],"pdf_url":"https://arxiv.org/pdf/2310.07664v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07663v1","updated":"2023-10-11T17:03:21Z","published":"2023-10-11T17:03:21Z","title":"Deep Video Inpainting Guided by Audio-Visual Self-Supervision","summary":" Humans can easily imagine a scene from auditory information based on their\nprior knowledge of audio-visual events. In this paper, we mimic this innate\nhuman ability in deep learning models to improve the quality of video\ninpainting. To implement the prior knowledge, we first train the audio-visual\nnetwork, which learns the correspondence between auditory and visual\ninformation. Then, the audio-visual network is employed as a guider that\nconveys the prior knowledge of audio-visual correspondence to the video\ninpainting network. This prior knowledge is transferred through our proposed\ntwo novel losses: audio-visual attention loss and audio-visual pseudo-class\nconsistency loss. These two losses further improve the performance of the video\ninpainting by encouraging the inpainting result to have a high correspondence\nto its synchronized audio. Experimental results demonstrate that our proposed\nmethod can restore a wider domain of video scenes and is particularly effective\nwhen the sounding object in the scene is partially blinded.\n","authors":["Kyuyeon Kim","Junsik Jung","Woo Jae Kim","Sung-Eui Yoon"],"pdf_url":"https://arxiv.org/pdf/2310.07663v1.pdf","comment":"Accepted at ICASSP 2022"},{"id":"http://arxiv.org/abs/2305.13172v2","updated":"2023-10-11T16:51:50Z","published":"2023-05-22T16:00:00Z","title":"Editing Large Language Models: Problems, Methods, and Opportunities","summary":" Despite the ability to train capable LLMs, the methodology for maintaining\ntheir relevancy and rectifying errors remains elusive. To this end, the past\nfew years have witnessed a surge in techniques for editing LLMs, the objective\nof which is to efficiently alter the behavior of LLMs within a specific domain\nwithout negatively impacting performance across other inputs. This paper\nembarks on a deep exploration of the problems, methods, and opportunities\nrelated to model editing for LLMs. In particular, we provide an exhaustive\noverview of the task definition and challenges associated with model editing,\nalong with an in-depth empirical analysis of the most progressive methods\ncurrently at our disposal. We also build a new benchmark dataset to facilitate\na more robust evaluation and pinpoint enduring issues intrinsic to existing\ntechniques. Our objective is to provide valuable insights into the\neffectiveness and feasibility of each editing technique, thereby assisting the\ncommunity in making informed decisions on the selection of the most appropriate\nmethod for a specific task or context. Code and datasets are available at\nhttps://github.com/zjunlp/EasyEdit.\n","authors":["Yunzhi Yao","Peng Wang","Bozhong Tian","Siyuan Cheng","Zhoubo Li","Shumin Deng","Huajun Chen","Ningyu Zhang"],"pdf_url":"https://arxiv.org/pdf/2305.13172v2.pdf","comment":"EMNLP 2023. Updated with new experiments"},{"id":"http://arxiv.org/abs/2310.07638v1","updated":"2023-10-11T16:33:30Z","published":"2023-10-11T16:33:30Z","title":"Context-Enhanced Detector For Building Detection From Remote Sensing\n Images","summary":" The field of building detection from remote sensing images has made\nsignificant progress, but faces challenges in achieving high-accuracy detection\ndue to the diversity in building appearances and the complexity of vast scenes.\nTo address these challenges, we propose a novel approach called\nContext-Enhanced Detector (CEDet). Our approach utilizes a three-stage cascade\nstructure to enhance the extraction of contextual information and improve\nbuilding detection accuracy. Specifically, we introduce two modules: the\nSemantic Guided Contextual Mining (SGCM) module, which aggregates multi-scale\ncontexts and incorporates an attention mechanism to capture long-range\ninteractions, and the Instance Context Mining Module (ICMM), which captures\ninstance-level relationship context by constructing a spatial relationship\ngraph and aggregating instance features. Additionally, we introduce a semantic\nsegmentation loss based on pseudo-masks to guide contextual information\nextraction. Our method achieves state-of-the-art performance on three building\ndetection benchmarks, including CNBuilding-9P, CNBuilding-23P, and SpaceNet.\n","authors":["Ziyue Huang","Mingming Zhang","Qingjie Liu","Wei Wang","Zhe Dong","Yunhong Wang"],"pdf_url":"https://arxiv.org/pdf/2310.07638v1.pdf","comment":"12 pages, 7 figures"},{"id":"http://arxiv.org/abs/2310.07633v1","updated":"2023-10-11T16:28:24Z","published":"2023-10-11T16:28:24Z","title":"Attention-Map Augmentation for Hypercomplex Breast Cancer Classification","summary":" Breast cancer is the most widespread neoplasm among women and early detection\nof this disease is critical. Deep learning techniques have become of great\ninterest to improve diagnostic performance. Nonetheless, discriminating between\nmalignant and benign masses from whole mammograms remains challenging due to\nthem being almost identical to an untrained eye and the region of interest\n(ROI) occupying a minuscule portion of the entire image. In this paper, we\npropose a framework, parameterized hypercomplex attention maps (PHAM), to\novercome these problems. Specifically, we deploy an augmentation step based on\ncomputing attention maps. Then, the attention maps are used to condition the\nclassification step by constructing a multi-dimensional input comprised of the\noriginal breast cancer image and the corresponding attention map. In this step,\na parameterized hypercomplex neural network (PHNN) is employed to perform\nbreast cancer classification. The framework offers two main advantages. First,\nattention maps provide critical information regarding the ROI and allow the\nneural model to concentrate on it. Second, the hypercomplex architecture has\nthe ability to model local relations between input dimensions thanks to\nhypercomplex algebra rules, thus properly exploiting the information provided\nby the attention map. We demonstrate the efficacy of the proposed framework on\nboth mammography images as well as histopathological ones, surpassing\nattention-based state-of-the-art networks and the real-valued counterpart of\nour method. The code of our work is available at\nhttps://github.com/elelo22/AttentionBCS.\n","authors":["Eleonora Lopez","Filippo Betello","Federico Carmignani","Eleonora Grassucci","Danilo Comminiello"],"pdf_url":"https://arxiv.org/pdf/2310.07633v1.pdf","comment":"Submitted to Pattern Recognition Letters"},{"id":"http://arxiv.org/abs/2310.07632v1","updated":"2023-10-11T16:25:45Z","published":"2023-10-11T16:25:45Z","title":"Prompt Backdoors in Visual Prompt Learning","summary":" Fine-tuning large pre-trained computer vision models is infeasible for\nresource-limited users. Visual prompt learning (VPL) has thus emerged to\nprovide an efficient and flexible alternative to model fine-tuning through\nVisual Prompt as a Service (VPPTaaS). Specifically, the VPPTaaS provider\noptimizes a visual prompt given downstream data, and downstream users can use\nthis prompt together with the large pre-trained model for prediction. However,\nthis new learning paradigm may also pose security risks when the VPPTaaS\nprovider instead provides a malicious visual prompt. In this paper, we take the\nfirst step to explore such risks through the lens of backdoor attacks.\nSpecifically, we propose BadVisualPrompt, a simple yet effective backdoor\nattack against VPL. For example, poisoning $5\\%$ CIFAR10 training data leads to\nabove $99\\%$ attack success rates with only negligible model accuracy drop by\n$1.5\\%$. In particular, we identify and then address a new technical challenge\nrelated to interactions between the backdoor trigger and visual prompt, which\ndoes not exist in conventional, model-level backdoors. Moreover, we provide\nin-depth analyses of seven backdoor defenses from model, prompt, and input\nlevels. Overall, all these defenses are either ineffective or impractical to\nmitigate our BadVisualPrompt, implying the critical vulnerability of VPL.\n","authors":["Hai Huang","Zhengyu Zhao","Michael Backes","Yun Shen","Yang Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.07632v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07623v1","updated":"2023-10-11T16:06:14Z","published":"2023-10-11T16:06:14Z","title":"Dual Quaternion Rotational and Translational Equivariance in 3D Rigid\n Motion Modelling","summary":" Objects' rigid motions in 3D space are described by rotations and\ntranslations of a highly-correlated set of points, each with associated $x,y,z$\ncoordinates that real-valued networks consider as separate entities, losing\ninformation. Previous works exploit quaternion algebra and their ability to\nmodel rotations in 3D space. However, these algebras do not properly encode\ntranslations, leading to sub-optimal performance in 3D learning tasks. To\novercome these limitations, we employ a dual quaternion representation of rigid\nmotions in the 3D space that jointly describes rotations and translations of\npoint sets, processing each of the points as a single entity. Our approach is\ntranslation and rotation equivariant, so it does not suffer from shifts in the\ndata and better learns object trajectories, as we validate in the experimental\nevaluations. Models endowed with this formulation outperform previous\napproaches in a human pose forecasting application, attesting to the\neffectiveness of the proposed dual quaternion formulation for rigid motions in\n3D space.\n","authors":["Guilherme Vieira","Eleonora Grassucci","Marcos Eduardo Valle","Danilo Comminiello"],"pdf_url":"https://arxiv.org/pdf/2310.07623v1.pdf","comment":"Accepted at IEEE MLSP 2023 (Honorable Mention Top 10% Outstanding\n Paper)"},{"id":"http://arxiv.org/abs/2310.07602v1","updated":"2023-10-11T15:41:52Z","published":"2023-10-11T15:41:52Z","title":"Dual Radar: A Multi-modal Dataset with Dual 4D Radar for Autononous\n Driving","summary":" Radar has stronger adaptability in adverse scenarios for autonomous driving\nenvironmental perception compared to widely adopted cameras and LiDARs.\nCompared with commonly used 3D radars, latest 4D radars have precise vertical\nresolution and higher point cloud density, making it a highly promising sensor\nfor autonomous driving in complex environmental perception. However, due to the\nmuch higher noise than LiDAR, manufacturers choose different filtering\nstrategies, resulting in an inverse ratio between noise level and point cloud\ndensity. There is still a lack of comparative analysis on which method is\nbeneficial for deep learning-based perception algorithms in autonomous driving.\nOne of the main reasons is that current datasets only adopt one type of 4D\nradar, making it difficult to compare different 4D radars in the same scene.\nTherefore, in this paper, we introduce a novel large-scale multi-modal dataset\nfeaturing, for the first time, two types of 4D radars captured simultaneously.\nThis dataset enables further research into effective 4D radar perception\nalgorithms.Our dataset consists of 151 consecutive series, most of which last\n20 seconds and contain 10,007 meticulously synchronized and annotated frames.\nMoreover, our dataset captures a variety of challenging driving scenarios,\nincluding many road conditions, weather conditions, nighttime and daytime with\ndifferent lighting intensities and periods. Our dataset annotates consecutive\nframes, which can be applied to 3D object detection and tracking, and also\nsupports the study of multi-modal tasks. We experimentally validate our\ndataset, providing valuable results for studying different types of 4D radars.\nThis dataset is released on https://github.com/adept-thu/Dual-Radar.\n","authors":["Xinyu Zhang","Li Wang","Jian Chen","Cheng Fang","Lei Yang","Ziying Song","Guangqi Yang","Yichen Wang","Xiaofei Zhang","Jun Li"],"pdf_url":"https://arxiv.org/pdf/2310.07602v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07591v1","updated":"2023-10-11T15:33:10Z","published":"2023-10-11T15:33:10Z","title":"PeP: a Point enhanced Painting method for unified point cloud tasks","summary":" Point encoder is of vital importance for point cloud recognition. As the very\nbeginning step of whole model pipeline, adding features from diverse sources\nand providing stronger feature encoding mechanism would provide better input\nfor downstream modules. In our work, we proposed a novel PeP module to tackle\nabove issue. PeP contains two main parts, a refined point painting method and a\nLM-based point encoder. Experiments results on the nuScenes and KITTI datasets\nvalidate the superior performance of our PeP. The advantages leads to strong\nperformance on both semantic segmentation and object detection, in both lidar\nand multi-modal settings. Notably, our PeP module is model agnostic and\nplug-and-play. Our code will be publicly available soon.\n","authors":["Zichao Dong","Hang Ji","Xufeng Huang","Weikun Zhang","Xin Zhan","Junbo Chen"],"pdf_url":"https://arxiv.org/pdf/2310.07591v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07585v1","updated":"2023-10-11T15:21:40Z","published":"2023-10-11T15:21:40Z","title":"A Discrepancy Aware Framework for Robust Anomaly Detection","summary":" Defect detection is a critical research area in artificial intelligence.\nRecently, synthetic data-based self-supervised learning has shown great\npotential on this task. Although many sophisticated synthesizing strategies\nexist, little research has been done to investigate the robustness of models\nwhen faced with different strategies. In this paper, we focus on this issue and\nfind that existing methods are highly sensitive to them. To alleviate this\nissue, we present a Discrepancy Aware Framework (DAF), which demonstrates\nrobust performance consistently with simple and cheap strategies across\ndifferent anomaly detection benchmarks. We hypothesize that the high\nsensitivity to synthetic data of existing self-supervised methods arises from\ntheir heavy reliance on the visual appearance of synthetic data during\ndecoding. In contrast, our method leverages an appearance-agnostic cue to guide\nthe decoder in identifying defects, thereby alleviating its reliance on\nsynthetic appearance. To this end, inspired by existing knowledge distillation\nmethods, we employ a teacher-student network, which is trained based on\nsynthesized outliers, to compute the discrepancy map as the cue. Extensive\nexperiments on two challenging datasets prove the robustness of our method.\nUnder the simple synthesis strategies, it outperforms existing methods by a\nlarge margin. Furthermore, it also achieves the state-of-the-art localization\nperformance. Code is available at: https://github.com/caiyuxuan1120/DAF.\n","authors":["Yuxuan Cai","Dingkang Liang","Dongliang Luo","Xinwei He","Xin Yang","Xiang Bai"],"pdf_url":"https://arxiv.org/pdf/2310.07585v1.pdf","comment":"Accepted by IEEE Transactions on Industrial Informatics. Code is\n available at: https://github.com/caiyuxuan1120/DAF"},{"id":"http://arxiv.org/abs/2310.07584v1","updated":"2023-10-11T15:20:44Z","published":"2023-10-11T15:20:44Z","title":"Centrality of the Fingerprint Core Location","summary":" Fingerprints have long been recognized as a unique and reliable means of\npersonal identification. Central to the analysis and enhancement of\nfingerprints is the concept of the fingerprint core. Although the location of\nthe core is used in many applications, to the best of our knowledge, this study\nis the first to investigate the empirical distribution of the core over a\nlarge, combined dataset of rolled, as well as plain fingerprint recordings. We\nidentify and investigate the extent of incomplete rolling during the rolled\nfingerprint acquisition and investigate the centrality of the core. After\ncorrecting for the incomplete rolling, we find that the core deviates from the\nfingerprint center by 5.7% $\\pm$ 5.2% to 7.6% $\\pm$ 6.9%, depending on the\nfinger. Additionally, we find that the assumption of normal distribution of the\ncore position of plain fingerprint recordings cannot be rejected, but for\nrolled ones it can. Therefore, we use a multi-step process to find the\ndistribution of the rolled fingerprint recordings. The process consists of an\nAnderson-Darling normality test, the Bayesian Information Criterion to reduce\nthe number of possible candidate distributions and finally a Generalized Monte\nCarlo goodness-of-fit procedure to find the best fitting distribution. We find\nthe non-central Fischer distribution best describes the cores' horizontal\npositions. Finally, we investigate the correlation between mean core position\noffset and the NFIQ 2 score and find that the NFIQ 2 prefers rolled fingerprint\nrecordings where the core sits slightly below the fingerprint center.\n","authors":["Laurenz Ruzicka","Bernhard Strobl","Bernhard Kohn","Clemens Heitzinger"],"pdf_url":"https://arxiv.org/pdf/2310.07584v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07573v1","updated":"2023-10-11T15:15:05Z","published":"2023-10-11T15:15:05Z","title":"Relational Prior Knowledge Graphs for Detection and Instance\n Segmentation","summary":" Humans have a remarkable ability to perceive and reason about the world\naround them by understanding the relationships between objects. In this paper,\nwe investigate the effectiveness of using such relationships for object\ndetection and instance segmentation. To this end, we propose a Relational\nPrior-based Feature Enhancement Model (RP-FEM), a graph transformer that\nenhances object proposal features using relational priors. The proposed\narchitecture operates on top of scene graphs obtained from initial proposals\nand aims to concurrently learn relational context modeling for object detection\nand instance segmentation. Experimental evaluations on COCO show that the\nutilization of scene graphs, augmented with relational priors, offer benefits\nfor object detection and instance segmentation. RP-FEM demonstrates its\ncapacity to suppress improbable class predictions within the image while also\npreventing the model from generating duplicate predictions, leading to\nimprovements over the baseline model on which it is built.\n","authors":["Osman Ülger","Yu Wang","Ysbrand Galama","Sezer Karaoglu","Theo Gevers","Martin R. Oswald"],"pdf_url":"https://arxiv.org/pdf/2310.07573v1.pdf","comment":"Published in ICCV2023 SG2RL Workshop"},{"id":"http://arxiv.org/abs/2310.07572v1","updated":"2023-10-11T15:14:54Z","published":"2023-10-11T15:14:54Z","title":"Impact of Label Types on Training SWIN Models with Overhead Imagery","summary":" Understanding the impact of data set design on model training and performance\ncan help alleviate the costs associated with generating remote sensing and\noverhead labeled data. This work examined the impact of training shifted window\ntransformers using bounding boxes and segmentation labels, where the latter are\nmore expensive to produce. We examined classification tasks by comparing models\ntrained with both target and backgrounds against models trained with only\ntarget pixels, extracted by segmentation labels. For object detection models,\nwe compared performance using either label type when training. We found that\nthe models trained on only target pixels do not show performance improvement\nfor classification tasks, appearing to conflate background pixels in the\nevaluation set with target pixels. For object detection, we found that models\ntrained with either label type showed equivalent performance across testing. We\nfound that bounding boxes appeared to be sufficient for tasks that did not\nrequire more complex labels, such as object segmentation. Continuing work to\ndetermine consistency of this result across data types and model architectures\ncould potentially result in substantial savings in generating remote sensing\ndata sets for deep learning.\n","authors":["Ryan Ford","Kenneth Hutchison","Nicholas Felts","Benjamin Cheng","Jesse Lew","Kyle Jackson"],"pdf_url":"https://arxiv.org/pdf/2310.07572v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2211.15751v3","updated":"2023-10-11T15:13:02Z","published":"2022-11-28T20:11:37Z","title":"Edge Video Analytics: A Survey on Applications, Systems and Enabling\n Techniques","summary":" Video, as a key driver in the global explosion of digital information, can\ncreate tremendous benefits for human society. Governments and enterprises are\ndeploying innumerable cameras for a variety of applications, e.g., law\nenforcement, emergency management, traffic control, and security surveillance,\nall facilitated by video analytics (VA). This trend is spurred by the rapid\nadvancement of deep learning (DL), which enables more precise models for object\nclassification, detection, and tracking. Meanwhile, with the proliferation of\nInternet-connected devices, massive amounts of data are generated daily,\noverwhelming the cloud. Edge computing, an emerging paradigm that moves\nworkloads and services from the network core to the network edge, has been\nwidely recognized as a promising solution. The resulting new intersection, edge\nvideo analytics (EVA), begins to attract widespread attention. Nevertheless,\nonly a few loosely-related surveys exist on this topic. The basic concepts of\nEVA (e.g., definition, architectures) were not fully elucidated due to the\nrapid development of this domain. To fill these gaps, we provide a\ncomprehensive survey of the recent efforts on EVA. In this paper, we first\nreview the fundamentals of edge computing, followed by an overview of VA. EVA\nsystems and their enabling techniques are discussed next. In addition, we\nintroduce prevalent frameworks and datasets to aid future researchers in the\ndevelopment of EVA systems. Finally, we discuss existing challenges and foresee\nfuture research directions. We believe this survey will help readers comprehend\nthe relationship between VA and edge computing, and spark new ideas on EVA.\n","authors":["Renjie Xu","Saiedeh Razavi","Rong Zheng"],"pdf_url":"https://arxiv.org/pdf/2211.15751v3.pdf","comment":"Accepted in IEEE Communications Surveys and Tutorials, 2023"},{"id":"http://arxiv.org/abs/2304.14933v2","updated":"2023-10-11T15:08:51Z","published":"2023-04-28T15:43:21Z","title":"An Empirical Study of Multimodal Model Merging","summary":" Model merging (e.g., via interpolation or task arithmetic) fuses multiple\nmodels trained on different tasks to generate a multi-task solution. The\ntechnique has been proven successful in previous studies, where the models are\ntrained on similar tasks and with the same initialization. In this paper, we\nexpand on this concept to a multimodal setup by merging transformers trained on\ndifferent modalities. Furthermore, we conduct our study for a novel goal where\nwe can merge vision, language, and cross-modal transformers of a\nmodality-specific architecture to create a parameter-efficient\nmodality-agnostic architecture. Through comprehensive experiments, we\nsystematically investigate the key factors impacting model performance after\nmerging, including initialization, merging mechanisms, and model architectures.\nWe also propose two metrics that assess the distance between weights to be\nmerged and can serve as an indicator of the merging outcomes. Our analysis\nleads to an effective training recipe for matching the performance of the\nmodality-agnostic baseline (i.e., pre-trained from scratch) via model merging.\nOur method also outperforms naive merging significantly on various tasks, with\nimprovements of 3% on VQA, 7% on COCO retrieval, 25% on NLVR2, 14% on Flickr30k\nand 3% on ADE20k. Our code is available at https://github.com/ylsung/vl-merging\n","authors":["Yi-Lin Sung","Linjie Li","Kevin Lin","Zhe Gan","Mohit Bansal","Lijuan Wang"],"pdf_url":"https://arxiv.org/pdf/2304.14933v2.pdf","comment":"EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2310.07555v1","updated":"2023-10-11T15:00:11Z","published":"2023-10-11T15:00:11Z","title":"Does resistance to Style-Transfer equal Shape Bias? Evaluating Shape\n Bias by Distorted Shape","summary":" Deep learning models are known to exhibit a strong texture bias, while human\ntends to rely heavily on global shape for object recognition. The current\nbenchmark for evaluating a model's shape bias is a set of style-transferred\nimages with the assumption that resistance to the attack of style transfer is\nrelated to the development of shape sensitivity in the model. In this work, we\nshow that networks trained with style-transfer images indeed learn to ignore\nstyle, but its shape bias arises primarily from local shapes. We provide a\nDistorted Shape Testbench (DiST) as an alternative measurement of global shape\nsensitivity. Our test includes 2400 original images from ImageNet-1K, each of\nwhich is accompanied by two images with the global shapes of the original image\ndistorted while preserving its texture via the texture synthesis program. We\nfound that (1) models that performed well on the previous shape bias evaluation\ndo not fare well in the proposed DiST; (2) the widely adopted ViT models do not\nshow significant advantages over Convolutional Neural Networks (CNNs) on this\nbenchmark despite that ViTs rank higher on the previous shape bias tests. (3)\ntraining with DiST images bridges the significant gap between human and\nexisting SOTA models' performance while preserving the models' accuracy on\nstandard image classification tasks; training with DiST images and\nstyle-transferred images are complementary, and can be combined to train\nnetwork together to enhance both the global and local shape sensitivity of the\nnetwork. Our code will be host at: https://github.com/leelabcnbc/DiST\n","authors":["Ziqi Wen","Tianqin Li","Tai Sing Lee"],"pdf_url":"https://arxiv.org/pdf/2310.07555v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07552v1","updated":"2023-10-11T14:54:40Z","published":"2023-10-11T14:54:40Z","title":"ProtoHPE: Prototype-guided High-frequency Patch Enhancement for\n Visible-Infrared Person Re-identification","summary":" Visible-infrared person re-identification is challenging due to the large\nmodality gap. To bridge the gap, most studies heavily rely on the correlation\nof visible-infrared holistic person images, which may perform poorly under\nsevere distribution shifts. In contrast, we find that some cross-modal\ncorrelated high-frequency components contain discriminative visual patterns and\nare less affected by variations such as wavelength, pose, and background\nclutter than holistic images. Therefore, we are motivated to bridge the\nmodality gap based on such high-frequency components, and propose\n\\textbf{Proto}type-guided \\textbf{H}igh-frequency \\textbf{P}atch\n\\textbf{E}nhancement (ProtoHPE) with two core designs. \\textbf{First}, to\nenhance the representation ability of cross-modal correlated high-frequency\ncomponents, we split patches with such components by Wavelet Transform and\nexponential moving average Vision Transformer (ViT), then empower ViT to take\nthe split patches as auxiliary input. \\textbf{Second}, to obtain semantically\ncompact and discriminative high-frequency representations of the same identity,\nwe propose Multimodal Prototypical Contrast. To be specific, it hierarchically\ncaptures the comprehensive semantics of different modal instances, facilitating\nthe aggregation of high-frequency representations belonging to the same\nidentity. With it, ViT can capture key high-frequency components during\ninference without relying on ProtoHPE, thus bringing no extra complexity.\nExtensive experiments validate the effectiveness of ProtoHPE.\n","authors":["Guiwei Zhang","Yongfei Zhang","Zichang Tan"],"pdf_url":"https://arxiv.org/pdf/2310.07552v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07548v1","updated":"2023-10-11T14:50:52Z","published":"2023-10-11T14:50:52Z","title":"Attribute Localization and Revision Network for Zero-Shot Learning","summary":" Zero-shot learning enables the model to recognize unseen categories with the\naid of auxiliary semantic information such as attributes. Current works\nproposed to detect attributes from local image regions and align extracted\nfeatures with class-level semantics. In this paper, we find that the choice\nbetween local and global features is not a zero-sum game, global features can\nalso contribute to the understanding of attributes. In addition, aligning\nattribute features with class-level semantics ignores potential intra-class\nattribute variation. To mitigate these disadvantages, we present Attribute\nLocalization and Revision Network in this paper. First, we design Attribute\nLocalization Module (ALM) to capture both local and global features from image\nregions, a novel module called Scale Control Unit is incorporated to fuse\nglobal and local representations. Second, we propose Attribute Revision Module\n(ARM), which generates image-level semantics by revising the ground-truth value\nof each attribute, compensating for performance degradation caused by ignoring\nintra-class variation. Finally, the output of ALM will be aligned with revised\nsemantics produced by ARM to achieve the training process. Comprehensive\nexperimental results on three widely used benchmarks demonstrate the\neffectiveness of our model in the zero-shot prediction task.\n","authors":["Junzhe Xu","Suling Duan","Chenwei Tang","Zhenan He","Jiancheng Lv"],"pdf_url":"https://arxiv.org/pdf/2310.07548v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.12067v2","updated":"2023-10-11T14:49:26Z","published":"2023-08-23T11:27:30Z","title":"InstructionGPT-4: A 200-Instruction Paradigm for Fine-Tuning MiniGPT-4","summary":" Multimodal large language models are typically trained in two stages: first\npre-training on image-text pairs, and then fine-tuning using supervised\nvision-language instruction data. Recent studies have shown that large language\nmodels can achieve satisfactory results even with a limited amount of\nhigh-quality instruction-following data. In this paper, we introduce\nInstructionGPT-4, which is fine-tuned on a small dataset comprising only 200\nexamples, amounting to approximately 6\\% of the instruction-following data used\nin the alignment dataset for MiniGPT-4. To achieve this, we first propose\nseveral metrics to access the quality of multimodal instruction data. Based on\nthese metrics, we present an effective and trainable data selector to\nautomatically identify and filter low-quality vision-language data. By\nemploying this method, InstructionGPT-4 outperforms the original MiniGPT-4 on\nvarious evaluations. Overall, our findings demonstrate that less but\nhigh-quality instruction tuning data is efficient in enabling multimodal large\nlanguage models to generate better output. Our code is available at\nhttps://github.com/waltonfuture/InstructionGPT-4.\n","authors":["Lai Wei","Zihao Jiang","Weiran Huang","Lichao Sun"],"pdf_url":"https://arxiv.org/pdf/2308.12067v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07534v1","updated":"2023-10-11T14:39:12Z","published":"2023-10-11T14:39:12Z","title":"Human-Centered Evaluation of XAI Methods","summary":" In the ever-evolving field of Artificial Intelligence, a critical challenge\nhas been to decipher the decision-making processes within the so-called \"black\nboxes\" in deep learning. Over recent years, a plethora of methods have emerged,\ndedicated to explaining decisions across diverse tasks. Particularly in tasks\nlike image classification, these methods typically identify and emphasize the\npivotal pixels that most influence a classifier's prediction. Interestingly,\nthis approach mirrors human behavior: when asked to explain our rationale for\nclassifying an image, we often point to the most salient features or aspects.\nCapitalizing on this parallel, our research embarked on a user-centric study.\nWe sought to objectively measure the interpretability of three leading\nexplanation methods: (1) Prototypical Part Network, (2) Occlusion, and (3)\nLayer-wise Relevance Propagation. Intriguingly, our results highlight that\nwhile the regions spotlighted by these methods can vary widely, they all offer\nhumans a nearly equivalent depth of understanding. This enables users to\ndiscern and categorize images efficiently, reinforcing the value of these\nmethods in enhancing AI transparency.\n","authors":["Karam Dawoud","Wojciech Samek","Sebastian Lapuschkin","Sebastian Bosse"],"pdf_url":"https://arxiv.org/pdf/2310.07534v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12813v2","updated":"2023-10-11T14:35:26Z","published":"2023-07-24T14:06:54Z","title":"Described Object Detection: Liberating Object Detection with Flexible\n Expressions","summary":" Detecting objects based on language information is a popular task that\nincludes Open-Vocabulary object Detection (OVD) and Referring Expression\nComprehension (REC). In this paper, we advance them to a more practical setting\ncalled Described Object Detection (DOD) by expanding category names to flexible\nlanguage expressions for OVD and overcoming the limitation of REC only\ngrounding the pre-existing object. We establish the research foundation for DOD\nby constructing a Description Detection Dataset ($D^3$). This dataset features\nflexible language expressions, whether short category names or long\ndescriptions, and annotating all described objects on all images without\nomission. By evaluating previous SOTA methods on $D^3$, we find some\ntroublemakers that fail current REC, OVD, and bi-functional methods. REC\nmethods struggle with confidence scores, rejecting negative instances, and\nmulti-target scenarios, while OVD methods face constraints with long and\ncomplex descriptions. Recent bi-functional methods also do not work well on DOD\ndue to their separated training procedures and inference strategies for REC and\nOVD tasks. Building upon the aforementioned findings, we propose a baseline\nthat largely improves REC methods by reconstructing the training data and\nintroducing a binary classification sub-task, outperforming existing methods.\nData and code are available at https://github.com/shikras/d-cube and related\nworks are tracked in\nhttps://github.com/Charles-Xie/awesome-described-object-detection.\n","authors":["Chi Xie","Zhao Zhang","Yixuan Wu","Feng Zhu","Rui Zhao","Shuang Liang"],"pdf_url":"https://arxiv.org/pdf/2307.12813v2.pdf","comment":"Accepted by NeurIPS 2023"},{"id":"http://arxiv.org/abs/2304.01950v2","updated":"2023-10-11T14:21:29Z","published":"2023-04-01T09:16:40Z","title":"MP-FedCL: Multiprototype Federated Contrastive Learning for Edge\n Intelligence","summary":" Federated learning-assisted edge intelligence enables privacy protection in\nmodern intelligent services. However, not independent and identically\ndistributed (non-IID) distribution among edge clients can impair the local\nmodel performance. The existing single prototype-based strategy represents a\nclass by using the mean of the feature space. However, feature spaces are\nusually not clustered, and a single prototype may not represent a class well.\nMotivated by this, this paper proposes a multi-prototype federated contrastive\nlearning approach (MP-FedCL) which demonstrates the effectiveness of using a\nmulti-prototype strategy over a single-prototype under non-IID settings,\nincluding both label and feature skewness. Specifically, a multi-prototype\ncomputation strategy based on \\textit{k-means} is first proposed to capture\ndifferent embedding representations for each class space, using multiple\nprototypes ($k$ centroids) to represent a class in the embedding space. In each\nglobal round, the computed multiple prototypes and their respective model\nparameters are sent to the edge server for aggregation into a global prototype\npool, which is then sent back to all clients to guide their local training.\nFinally, local training for each client minimizes their own supervised learning\ntasks and learns from shared prototypes in the global prototype pool through\nsupervised contrastive learning, which encourages them to learn knowledge\nrelated to their own class from others and reduces the absorption of unrelated\nknowledge in each global iteration. Experimental results on MNIST, Digit-5,\nOffice-10, and DomainNet show that our method outperforms multiple baselines,\nwith an average test accuracy improvement of about 4.6\\% and 10.4\\% under\nfeature and label non-IID distributions, respectively.\n","authors":["Yu Qiao","Md. Shirajum Munir","Apurba Adhikary","Huy Q. Le","Avi Deb Raha","Chaoning Zhang","Choong Seon Hong"],"pdf_url":"https://arxiv.org/pdf/2304.01950v2.pdf","comment":"Accepted by IEEE Internet of Things"},{"id":"http://arxiv.org/abs/2310.07522v1","updated":"2023-10-11T14:19:05Z","published":"2023-10-11T14:19:05Z","title":"S4C: Self-Supervised Semantic Scene Completion with Neural Fields","summary":" 3D semantic scene understanding is a fundamental challenge in computer\nvision. It enables mobile agents to autonomously plan and navigate arbitrary\nenvironments. SSC formalizes this challenge as jointly estimating dense\ngeometry and semantic information from sparse observations of a scene. Current\nmethods for SSC are generally trained on 3D ground truth based on aggregated\nLiDAR scans. This process relies on special sensors and annotation by hand\nwhich are costly and do not scale well. To overcome this issue, our work\npresents the first self-supervised approach to SSC called S4C that does not\nrely on 3D ground truth data. Our proposed method can reconstruct a scene from\na single image and only relies on videos and pseudo segmentation ground truth\ngenerated from off-the-shelf image segmentation network during training. Unlike\nexisting methods, which use discrete voxel grids, we represent scenes as\nimplicit semantic fields. This formulation allows querying any point within the\ncamera frustum for occupancy and semantic class. Our architecture is trained\nthrough rendering-based self-supervised losses. Nonetheless, our method\nachieves performance close to fully supervised state-of-the-art methods.\nAdditionally, our method demonstrates strong generalization capabilities and\ncan synthesize accurate segmentation maps for far away viewpoints.\n","authors":["Adrian Hayler","Felix Wimbauer","Dominik Muhle","Christian Rupprecht","Daniel Cremers"],"pdf_url":"https://arxiv.org/pdf/2310.07522v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07517v1","updated":"2023-10-11T14:15:25Z","published":"2023-10-11T14:15:25Z","title":"CM-PIE: Cross-modal perception for interactive-enhanced audio-visual\n video parsing","summary":" Audio-visual video parsing is the task of categorizing a video at the segment\nlevel with weak labels, and predicting them as audible or visible events.\nRecent methods for this task leverage the attention mechanism to capture the\nsemantic correlations among the whole video across the audio-visual modalities.\nHowever, these approaches have overlooked the importance of individual segments\nwithin a video and the relationship among them, and tend to rely on a single\nmodality when learning features. In this paper, we propose a novel\ninteractive-enhanced cross-modal perception method~(CM-PIE), which can learn\nfine-grained features by applying a segment-based attention module.\nFurthermore, a cross-modal aggregation block is introduced to jointly optimize\nthe semantic representation of audio and visual signals by enhancing\ninter-modal interactions. The experimental results show that our model offers\nimproved parsing performance on the Look, Listen, and Parse dataset compared to\nother methods.\n","authors":["Yaru Chen","Ruohao Guo","Xubo Liu","Peipei Wu","Guangyao Li","Zhenbo Li","Wenwu Wang"],"pdf_url":"https://arxiv.org/pdf/2310.07517v1.pdf","comment":"5 pages, 3 figures, 15 references"},{"id":"http://arxiv.org/abs/2310.07511v1","updated":"2023-10-11T14:07:05Z","published":"2023-10-11T14:07:05Z","title":"A Unified Remote Sensing Anomaly Detector Across Modalities and Scenes\n via Deviation Relationship Learning","summary":" Remote sensing anomaly detector can find the objects deviating from the\nbackground as potential targets. Given the diversity in earth anomaly types, a\nunified anomaly detector across modalities and scenes should be cost-effective\nand flexible to new earth observation sources and anomaly types. However, the\ncurrent anomaly detectors are limited to a single modality and single scene,\nsince they aim to learn the varying background distribution. Motivated by the\nuniversal anomaly deviation pattern, in that anomalies exhibit deviations from\ntheir local context, we exploit this characteristic to build a unified anomaly\ndetector. Firstly, we reformulate the anomaly detection task as an undirected\nbilayer graph based on the deviation relationship, where the anomaly score is\nmodeled as the conditional probability, given the pattern of the background and\nnormal objects. The learning objective is then expressed as a conditional\nprobability ranking problem. Furthermore, we design an instantiation of the\nreformulation in the data, architecture, and optimization aspects. Simulated\nspectral and spatial anomalies drive the instantiated architecture. The model\nis optimized directly for the conditional probability ranking. The proposed\nmodel was validated in five modalities including the hyperspectral, visible\nlight, synthetic aperture radar (SAR), infrared and low light to show its\nunified detection ability.\n","authors":["Jingtao Li","Xinyu Wang","Hengwei Zhao","Liangpei Zhang","Yanfei Zhong"],"pdf_url":"https://arxiv.org/pdf/2310.07511v1.pdf","comment":"Journal paper"},{"id":"http://arxiv.org/abs/2310.07510v1","updated":"2023-10-11T14:06:04Z","published":"2023-10-11T14:06:04Z","title":"Heuristic Vision Pre-Training with Self-Supervised and Supervised\n Multi-Task Learning","summary":" To mimic human vision with the way of recognizing the diverse and open world,\nfoundation vision models are much critical. While recent techniques of\nself-supervised learning show the promising potentiality of this mission, we\nargue that signals from labelled data are also important for common-sense\nrecognition, and properly chosen pre-text tasks can facilitate the efficiency\nof vision representation learning. To this end, we propose a novel pre-training\nframework by adopting both self-supervised and supervised visual pre-text tasks\nin a multi-task manner. Specifically, given an image, we take a heuristic way\nby considering its intrinsic style properties, inside objects with their\nlocations and correlations, and how it looks like in 3D space for basic visual\nunderstanding. However, large-scale object bounding boxes and correlations are\nusually hard to achieve. Alternatively, we develop a hybrid method by\nleveraging both multi-label classification and self-supervised learning. On the\none hand, under the multi-label supervision, the pre-trained model can explore\nthe detailed information of an image, e.g., image types, objects, and part of\nsemantic relations. On the other hand, self-supervised learning tasks, with\nrespect to Masked Image Modeling (MIM) and contrastive learning, can help the\nmodel learn pixel details and patch correlations. Results show that our\npre-trained models can deliver results on par with or better than\nstate-of-the-art (SOTA) results on multiple visual tasks. For example, with a\nvanilla Swin-B backbone, we achieve 85.3\\% top-1 accuracy on ImageNet-1K\nclassification, 47.9 box AP on COCO object detection for Mask R-CNN, and 50.6\nmIoU on ADE-20K semantic segmentation when using Upernet. The performance shows\nthe ability of our vision foundation model to serve general purpose vision\ntasks.\n","authors":["Zhiming Qian"],"pdf_url":"https://arxiv.org/pdf/2310.07510v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07506v1","updated":"2023-10-11T14:02:11Z","published":"2023-10-11T14:02:11Z","title":"Leveraging Hierarchical Feature Sharing for Efficient Dataset\n Condensation","summary":" Given a real-world dataset, data condensation (DC) aims to synthesize a\nsignificantly smaller dataset that captures the knowledge of this dataset for\nmodel training with high performance. Recent works propose to enhance DC with\ndata parameterization, which condenses data into parameterized data containers\nrather than pixel space. The intuition behind data parameterization is to\nencode shared features of images to avoid additional storage costs. In this\npaper, we recognize that images share common features in a hierarchical way due\nto the inherent hierarchical structure of the classification system, which is\noverlooked by current data parameterization methods. To better align DC with\nthis hierarchical nature and encourage more efficient information sharing\ninside data containers, we propose a novel data parameterization architecture,\nHierarchical Memory Network (HMN). HMN stores condensed data in a three-tier\nstructure, representing the dataset-level, class-level, and instance-level\nfeatures. Another helpful property of the hierarchical architecture is that HMN\nnaturally ensures good independence among images despite achieving information\nsharing. This enables instance-level pruning for HMN to reduce redundant\ninformation, thereby further minimizing redundancy and enhancing performance.\nWe evaluate HMN on four public datasets (SVHN, CIFAR10, CIFAR100, and\nTiny-ImageNet) and compare HMN with eight DC baselines. The evaluation results\nshow that our proposed method outperforms all baselines, even when trained with\na batch-based loss consuming less GPU memory.\n","authors":["Haizhong Zheng","Jiachen Sun","Shutong Wu","Bhavya Kailkhura","Zhuoqing Mao","Chaowei Xiao","Atul Prakash"],"pdf_url":"https://arxiv.org/pdf/2310.07506v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07504v1","updated":"2023-10-11T14:01:36Z","published":"2023-10-11T14:01:36Z","title":"PtychoDV: Vision Transformer-Based Deep Unrolling Network for\n Ptychographic Image Reconstruction","summary":" Ptychography is an imaging technique that captures multiple overlapping\nsnapshots of a sample, illuminated coherently by a moving localized probe. The\nimage recovery from ptychographic data is generally achieved via an iterative\nalgorithm that solves a nonlinear phase-field problem derived from measured\ndiffraction patterns. However, these approaches have high computational cost.\nIn this paper, we introduce PtychoDV, a novel deep model-based network designed\nfor efficient, high-quality ptychographic image reconstruction. PtychoDV\ncomprises a vision transformer that generates an initial image from the set of\nraw measurements, taking into consideration their mutual correlations. This is\nfollowed by a deep unrolling network that refines the initial image using\nlearnable convolutional priors and the ptychography measurement model.\nExperimental results on simulated data demonstrate that PtychoDV is capable of\noutperforming existing deep learning methods for this problem, and\nsignificantly reduces computational cost compared to iterative methodologies,\nwhile maintaining competitive performance.\n","authors":["Weijie Gan","Qiuchen Zhai","Michael Thompson McCann","Cristina Garcia Cardona","Ulugbek S. Kamilov","Brendt Wohlberg"],"pdf_url":"https://arxiv.org/pdf/2310.07504v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.08313v3","updated":"2023-10-11T13:55:29Z","published":"2023-08-16T12:18:27Z","title":"ECPC-IDS:A benchmark endometrail cancer PET/CT image dataset for\n evaluation of semantic segmentation and detection of hypermetabolic regions","summary":" Endometrial cancer is one of the most common tumors in the female\nreproductive system and is the third most common gynecological malignancy that\ncauses death after ovarian and cervical cancer. Early diagnosis can\nsignificantly improve the 5-year survival rate of patients. With the\ndevelopment of artificial intelligence, computer-assisted diagnosis plays an\nincreasingly important role in improving the accuracy and objectivity of\ndiagnosis, as well as reducing the workload of doctors. However, the absence of\npublicly available endometrial cancer image datasets restricts the application\nof computer-assisted diagnostic techniques.In this paper, a publicly available\nEndometrial Cancer PET/CT Image Dataset for Evaluation of Semantic Segmentation\nand Detection of Hypermetabolic Regions (ECPC-IDS) are published. Specifically,\nthe segmentation section includes PET and CT images, with a total of 7159\nimages in multiple formats. In order to prove the effectiveness of segmentation\nmethods on ECPC-IDS, five classical deep learning semantic segmentation methods\nare selected to test the image segmentation task. The object detection section\nalso includes PET and CT images, with a total of 3579 images and XML files with\nannotation information. Six deep learning methods are selected for experiments\non the detection task.This study conduct extensive experiments using deep\nlearning-based semantic segmentation and object detection methods to\ndemonstrate the differences between various methods on ECPC-IDS. As far as we\nknow, this is the first publicly available dataset of endometrial cancer with a\nlarge number of multiple images, including a large amount of information\nrequired for image and target detection. ECPC-IDS can aid researchers in\nexploring new algorithms to enhance computer-assisted technology, benefiting\nboth clinical doctors and patients greatly.\n","authors":["Dechao Tang","Tianming Du","Deguo Ma","Zhiyu Ma","Hongzan Sun","Marcin Grzegorzek","Huiyan Jiang","Chen Li"],"pdf_url":"https://arxiv.org/pdf/2308.08313v3.pdf","comment":"14 pages,6 figures"},{"id":"http://arxiv.org/abs/2310.07492v1","updated":"2023-10-11T13:39:11Z","published":"2023-10-11T13:39:11Z","title":"Boosting Black-box Attack to Deep Neural Networks with Conditional\n Diffusion Models","summary":" Existing black-box attacks have demonstrated promising potential in creating\nadversarial examples (AE) to deceive deep learning models. Most of these\nattacks need to handle a vast optimization space and require a large number of\nqueries, hence exhibiting limited practical impacts in real-world scenarios. In\nthis paper, we propose a novel black-box attack strategy, Conditional Diffusion\nModel Attack (CDMA), to improve the query efficiency of generating AEs under\nquery-limited situations. The key insight of CDMA is to formulate the task of\nAE synthesis as a distribution transformation problem, i.e., benign examples\nand their corresponding AEs can be regarded as coming from two distinctive\ndistributions and can transform from each other with a particular converter.\nUnlike the conventional \\textit{query-and-optimization} approach, we generate\neligible AEs with direct conditional transform using the aforementioned data\nconverter, which can significantly reduce the number of queries needed. CDMA\nadopts the conditional Denoising Diffusion Probabilistic Model as the\nconverter, which can learn the transformation from clean samples to AEs, and\nensure the smooth development of perturbed noise resistant to various defense\nstrategies. We demonstrate the effectiveness and efficiency of CDMA by\ncomparing it with nine state-of-the-art black-box attacks across three\nbenchmark datasets. On average, CDMA can reduce the query count to a handful of\ntimes; in most cases, the query count is only ONE. We also show that CDMA can\nobtain $>99\\%$ attack success rate for untarget attacks over all datasets and\ntargeted attack over CIFAR-10 with the noise budget of $\\epsilon=16$.\n","authors":["Renyang Liu","Wei Zhou","Tianwei Zhang","Kangjie Chen","Jun Zhao","Kwok-Yan Lam"],"pdf_url":"https://arxiv.org/pdf/2310.07492v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.11161v2","updated":"2023-10-11T13:29:23Z","published":"2023-04-02T16:03:44Z","title":"altiro3D: Scene representation from single image and novel view\n synthesis","summary":" We introduce altiro3D, a free extended library developed to represent reality\nstarting from a given original RGB image or flat video. It allows to generate a\nlight-field (or Native) image or video and get a realistic 3D experience. To\nsynthesize N-number of virtual images and add them sequentially into a Quilt\ncollage, we apply MiDaS models for the monocular depth estimation, simple\nOpenCV and Telea inpainting techniques to map all pixels, and implement a\n'Fast' algorithm to handle 3D projection camera and scene transformations along\nN-viewpoints. We use the degree of depth to move proportionally the pixels,\nassuming the original image to be at the center of all the viewpoints. altiro3D\ncan also be used with DIBR algorithm to compute intermediate snapshots from a\nequivalent 'Real (slower)' camera with N-geometric viewpoints, which requires\nto calibrate a priori several intrinsic and extrinsic camera parameters. We\nadopt a pixel- and device-based Lookup Table to optimize computing time. The\nmultiple viewpoints and video generated from a single image or frame can be\ndisplayed in a free-view LCD display.\n","authors":["E. Canessa","L. Tenze"],"pdf_url":"https://arxiv.org/pdf/2304.11161v2.pdf","comment":"In press (2023) Springer International Journal of Information\n Technology (IJIT) 10 pages, 3 figures"},{"id":"http://arxiv.org/abs/2310.07473v1","updated":"2023-10-11T13:19:29Z","published":"2023-10-11T13:19:29Z","title":"FGPrompt: Fine-grained Goal Prompting for Image-goal Navigation","summary":" Learning to navigate to an image-specified goal is an important but\nchallenging task for autonomous systems. The agent is required to reason the\ngoal location from where a picture is shot. Existing methods try to solve this\nproblem by learning a navigation policy, which captures semantic features of\nthe goal image and observation image independently and lastly fuses them for\npredicting a sequence of navigation actions. However, these methods suffer from\ntwo major limitations. 1) They may miss detailed information in the goal image,\nand thus fail to reason the goal location. 2) More critically, it is hard to\nfocus on the goal-relevant regions in the observation image, because they\nattempt to understand observation without goal conditioning. In this paper, we\naim to overcome these limitations by designing a Fine-grained Goal Prompting\n(FGPrompt) method for image-goal navigation. In particular, we leverage\nfine-grained and high-resolution feature maps in the goal image as prompts to\nperform conditioned embedding, which preserves detailed information in the goal\nimage and guides the observation encoder to pay attention to goal-relevant\nregions. Compared with existing methods on the image-goal navigation benchmark,\nour method brings significant performance improvement on 3 benchmark datasets\n(i.e., Gibson, MP3D, and HM3D). Especially on Gibson, we surpass the\nstate-of-the-art success rate by 8% with only 1/50 model size. Project page:\nhttps://xinyusun.github.io/fgprompt-pages\n","authors":["Xinyu Sun","Peihao Chen","Jugang Fan","Thomas H. Li","Jian Chen","Mingkui Tan"],"pdf_url":"https://arxiv.org/pdf/2310.07473v1.pdf","comment":"Accepted by NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.07449v1","updated":"2023-10-11T12:51:16Z","published":"2023-10-11T12:51:16Z","title":"PoRF: Pose Residual Field for Accurate Neural Surface Reconstruction","summary":" Neural surface reconstruction is sensitive to the camera pose noise, even if\nstate-of-the-art pose estimators like COLMAP or ARKit are used. More\nimportantly, existing Pose-NeRF joint optimisation methods have struggled to\nimprove pose accuracy in challenging real-world scenarios. To overcome the\nchallenges, we introduce the pose residual field (\\textbf{PoRF}), a novel\nimplicit representation that uses an MLP for regressing pose updates. This is\nmore robust than the conventional pose parameter optimisation due to parameter\nsharing that leverages global information over the entire sequence.\nFurthermore, we propose an epipolar geometry loss to enhance the supervision\nthat leverages the correspondences exported from COLMAP results without the\nextra computational overhead. Our method yields promising results. On the DTU\ndataset, we reduce the rotation error by 78\\% for COLMAP poses, leading to the\ndecreased reconstruction Chamfer distance from 3.48mm to 0.85mm. On the\nMobileBrick dataset that contains casually captured unbounded 360-degree\nvideos, our method refines ARKit poses and improves the reconstruction F1 score\nfrom 69.18 to 75.67, outperforming that with the dataset provided ground-truth\npose (75.14). These achievements demonstrate the efficacy of our approach in\nrefining camera poses and improving the accuracy of neural surface\nreconstruction in real-world scenarios.\n","authors":["Jia-Wang Bian","Wenjing Bian","Victor Adrian Prisacariu","Philip Torr"],"pdf_url":"https://arxiv.org/pdf/2310.07449v1.pdf","comment":"Under review"},{"id":"http://arxiv.org/abs/2310.07440v1","updated":"2023-10-11T12:46:11Z","published":"2023-10-11T12:46:11Z","title":"Distance-based Weighted Transformer Network for Image Completion","summary":" The challenge of image generation has been effectively modeled as a problem\nof structure priors or transformation. However, existing models have\nunsatisfactory performance in understanding the global input image structures\nbecause of particular inherent features (for example, local inductive prior).\nRecent studies have shown that self-attention is an efficient modeling\ntechnique for image completion problems. In this paper, we propose a new\narchitecture that relies on Distance-based Weighted Transformer (DWT) to better\nunderstand the relationships between an image's components. In our model, we\nleverage the strengths of both Convolutional Neural Networks (CNNs) and DWT\nblocks to enhance the image completion process. Specifically, CNNs are used to\naugment the local texture information of coarse priors and DWT blocks are used\nto recover certain coarse textures and coherent visual structures. Unlike\ncurrent approaches that generally use CNNs to create feature maps, we use the\nDWT to encode global dependencies and compute distance-based weighted feature\nmaps, which substantially minimizes the problem of visual ambiguities.\nMeanwhile, to better produce repeated textures, we introduce Residual Fast\nFourier Convolution (Res-FFC) blocks to combine the encoder's skip features\nwith the coarse features provided by our generator. Furthermore, a simple yet\neffective technique is proposed to normalize the non-zero values of\nconvolutions, and fine-tune the network layers for regularization of the\ngradient norms to provide an efficient training stabiliser. Extensive\nquantitative and qualitative experiments on three challenging datasets\ndemonstrate the superiority of our proposed model compared to existing\napproaches.\n","authors":["Pourya Shamsolmoali","Masoumeh Zareapoor","Huiyu Zhou","Xuelong Li","Yue Lu"],"pdf_url":"https://arxiv.org/pdf/2310.07440v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07438v1","updated":"2023-10-11T12:41:32Z","published":"2023-10-11T12:41:32Z","title":"DESTINE: Dynamic Goal Queries with Temporal Transductive Alignment for\n Trajectory Prediction","summary":" Predicting temporally consistent road users' trajectories in a multi-agent\nsetting is a challenging task due to unknown characteristics of agents and\ntheir varying intentions. Besides using semantic map information and modeling\ninteractions, it is important to build an effective mechanism capable of\nreasoning about behaviors at different levels of granularity. To this end, we\npropose Dynamic goal quErieS with temporal Transductive alIgNmEnt (DESTINE)\nmethod. Unlike past arts, our approach 1) dynamically predicts agents' goals\nirrespective of particular road structures, such as lanes, allowing the method\nto produce a more accurate estimation of destinations; 2) achieves map\ncompliant predictions by generating future trajectories in a coarse-to-fine\nfashion, where the coarser predictions at a lower frame rate serve as\nintermediate goals; and 3) uses an attention module designed to temporally\nalign predicted trajectories via masked attention. Using the common Argoverse\nbenchmark dataset, we show that our method achieves state-of-the-art\nperformance on various metrics, and further investigate the contributions of\nproposed modules via comprehensive ablation studies.\n","authors":["Rezaul Karim","Soheil Mohamad Alizadeh Shabestary","Amir Rasouli"],"pdf_url":"https://arxiv.org/pdf/2310.07438v1.pdf","comment":"6 tables 4 figures"},{"id":"http://arxiv.org/abs/2205.09615v4","updated":"2023-10-11T12:09:35Z","published":"2022-05-19T15:13:00Z","title":"EXACT: How to Train Your Accuracy","summary":" Classification tasks are usually evaluated in terms of accuracy. However,\naccuracy is discontinuous and cannot be directly optimized using gradient\nascent. Popular methods minimize cross-entropy, hinge loss, or other surrogate\nlosses, which can lead to suboptimal results. In this paper, we propose a new\noptimization framework by introducing stochasticity to a model's output and\noptimizing expected accuracy, i.e. accuracy of the stochastic model. Extensive\nexperiments on linear models and deep image classification show that the\nproposed optimization method is a powerful alternative to widely used\nclassification losses.\n","authors":["Ivan Karpukhin","Stanislav Dereka","Sergey Kolesnikov"],"pdf_url":"https://arxiv.org/pdf/2205.09615v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07419v1","updated":"2023-10-11T12:05:44Z","published":"2023-10-11T12:05:44Z","title":"Multi-Concept T2I-Zero: Tweaking Only The Text Embeddings and Nothing\n Else","summary":" Recent advances in text-to-image diffusion models have enabled the\nphotorealistic generation of images from text prompts. Despite the great\nprogress, existing models still struggle to generate compositional\nmulti-concept images naturally, limiting their ability to visualize human\nimagination. While several recent works have attempted to address this issue,\nthey either introduce additional training or adopt guidance at inference time.\nIn this work, we consider a more ambitious goal: natural multi-concept\ngeneration using a pre-trained diffusion model, and with almost no extra cost.\nTo achieve this goal, we identify the limitations in the text embeddings used\nfor the pre-trained text-to-image diffusion models. Specifically, we observe\nconcept dominance and non-localized contribution that severely degrade\nmulti-concept generation performance. We further design a minimal low-cost\nsolution that overcomes the above issues by tweaking (not re-training) the text\nembeddings for more realistic multi-concept text-to-image generation. Our\nCorrection by Similarities method tweaks the embedding of concepts by\ncollecting semantic features from most similar tokens to localize the\ncontribution. To avoid mixing features of concepts, we also apply Cross-Token\nNon-Maximum Suppression, which excludes the overlap of contributions from\ndifferent concepts. Experiments show that our approach outperforms previous\nmethods in text-to-image, image manipulation, and personalization tasks,\ndespite not introducing additional training or inference costs to the diffusion\nsteps.\n","authors":["Hazarapet Tunanyan","Dejia Xu","Shant Navasardyan","Zhangyang Wang","Humphrey Shi"],"pdf_url":"https://arxiv.org/pdf/2310.07419v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07416v1","updated":"2023-10-11T12:01:52Z","published":"2023-10-11T12:01:52Z","title":"A Novel Voronoi-based Convolutional Neural Network Framework for Pushing\n Person Detection in Crowd Videos","summary":" Analyzing the microscopic dynamics of pushing behavior within crowds can\noffer valuable insights into crowd patterns and interactions. By identifying\ninstances of pushing in crowd videos, a deeper understanding of when, where,\nand why such behavior occurs can be achieved. This knowledge is crucial to\ncreating more effective crowd management strategies, optimizing crowd flow, and\nenhancing overall crowd experiences. However, manually identifying pushing\nbehavior at the microscopic level is challenging, and the existing automatic\napproaches cannot detect such microscopic behavior. Thus, this article\nintroduces a novel automatic framework for identifying pushing in videos of\ncrowds on a microscopic level. The framework comprises two main components: i)\nFeature extraction and ii) Video labeling. In the feature extraction component,\na new Voronoi-based method is developed for determining the local regions\nassociated with each person in the input video. Subsequently, these regions are\nfed into EfficientNetV1B0 Convolutional Neural Network to extract the deep\nfeatures of each person over time. In the second component, a combination of a\nfully connected layer with a Sigmoid activation function is employed to analyze\nthese deep features and annotate the individuals involved in pushing within the\nvideo. The framework is trained and evaluated on a new dataset created using\nsix real-world experiments, including their corresponding ground truths. The\nexperimental findings indicate that the suggested framework outperforms seven\nbaseline methods that are employed for comparative analysis purposes.\n","authors":["Ahmed Alia","Mohammed Maree","Mohcine Chraibi","Armin Seyfried"],"pdf_url":"https://arxiv.org/pdf/2310.07416v1.pdf","comment":"21 pages"},{"id":"http://arxiv.org/abs/2309.14065v4","updated":"2023-10-11T11:43:41Z","published":"2023-09-25T11:57:16Z","title":"AsymFormer: Asymmetrical Cross-Modal Representation Learning for Mobile\n Platform Real-Time RGB-D Semantic Segmentation","summary":" In the realm of robotic intelligence, achieving efficient and precise RGB-D\nsemantic segmentation is a key cornerstone. State-of-the-art multimodal\nsemantic segmentation methods, primarily rooted in symmetrical skeleton\nnetworks, find it challenging to harmonize computational efficiency and\nprecision. In this work, we propose AsymFormer, a novel network for real-time\nRGB-D semantic segmentation, which targets the minimization of superfluous\nparameters by optimizing the distribution of computational resources and\nintroduces an asymmetrical backbone to allow for the effective fusion of\nmultimodal features. Furthermore, we explore techniques to bolster network\naccuracy by redefining feature selection and extracting multi-modal\nself-similarity features without a substantial increase in the parameter count,\nthereby ensuring real-time execution on robotic platforms. Additionally, a\nLocal Attention-Guided Feature Selection (LAFS) module is used to selectively\nfuse features from different modalities by leveraging their dependencies.\nSubsequently, a Cross-Modal Attention-Guided Feature Correlation Embedding\n(CMA) module is introduced to further extract cross-modal representations. This\nmethod is evaluated on NYUv2 and SUNRGBD datasets, with AsymFormer\ndemonstrating competitive results with 52.0% mIoU on NYUv2 and 49.1% mIoU on\nSUNRGBD. Notably, AsymFormer achieves an inference speed of 65 FPS and after\nimplementing mixed precision quantization, it attains an impressive inference\nspeed of 79 FPS on RTX3090. This significantly outperforms existing multi-modal\nmethods, thereby demonstrating that AsymFormer can strike a balance between\nhigh accuracy and efficiency for RGB-D semantic segmentation.\n","authors":["Siqi Du","Weixi Wang","Renzhong Guo","Shengjun Tang"],"pdf_url":"https://arxiv.org/pdf/2309.14065v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.04707v2","updated":"2023-10-11T11:30:32Z","published":"2023-03-08T16:48:24Z","title":"DiM: Distilling Dataset into Generative Model","summary":" Dataset distillation reduces the network training cost by synthesizing small\nand informative datasets from large-scale ones. Despite the success of the\nrecent dataset distillation algorithms, three drawbacks still limit their wider\napplication: i). the synthetic images perform poorly on large architectures;\nii). they need to be re-optimized when the distillation ratio changes; iii).\nthe limited diversity restricts the performance when the distillation ratio is\nlarge. In this paper, we propose a novel distillation scheme to\n\\textbf{D}istill information of large train sets \\textbf{i}nto generative\n\\textbf{M}odels, named DiM. Specifically, DiM learns to use a generative model\nto store the information of the target dataset. During the distillation phase,\nwe minimize the differences in logits predicted by a models pool between real\nand generated images. At the deployment stage, the generative model synthesizes\nvarious training samples from random noises on the fly. Due to the simple yet\neffective designs, the trained DiM can be directly applied to different\ndistillation ratios and large architectures without extra cost. We validate the\nproposed DiM across 4 datasets and achieve state-of-the-art results on all of\nthem. To the best of our knowledge, we are the first to achieve higher accuracy\non complex architectures than simple ones, such as 75.1\\% with ResNet-18 and\n72.6\\% with ConvNet-3 on ten images per class of CIFAR-10. Besides, DiM\noutperforms previous methods with 10\\% $\\sim$ 22\\% when images per class are 1\nand 10 on the SVHN dataset.\n","authors":["Kai Wang","Jianyang Gu","Daquan Zhou","Zheng Zhu","Wei Jiang","Yang You"],"pdf_url":"https://arxiv.org/pdf/2303.04707v2.pdf","comment":"Distilling datasets into generative models"},{"id":"http://arxiv.org/abs/2310.05682v2","updated":"2023-10-11T11:28:40Z","published":"2023-10-09T12:51:46Z","title":"Analysis of Rainfall Variability and Water Extent of Selected Hydropower\n Reservoir Using Google Earth Engine (GEE): A Case Study from Two Tropical\n Countries, Sri Lanka and Vietnam","summary":" This study presents a comprehensive remote sensing analysis of rainfall\npatterns and selected hydropower reservoir water extent in two tropical monsoon\ncountries, Vietnam and Sri Lanka. The aim is to understand the relationship\nbetween remotely sensed rainfall data and the dynamic changes (monthly) in\nreservoir water extent. The analysis utilizes high-resolution optical imagery\nand Sentinel-1 Synthetic Aperture Radar (SAR) data to observe and monitor water\nbodies during different weather conditions, especially during the monsoon\nseason. The average annual rainfall for both countries is determined, and\nspatiotemporal variations in monthly average rainfall are examined at regional\nand reservoir basin levels using the Climate Hazards Group InfraRed\nPrecipitation with Station (CHIRPS) dataset from 1981 to 2022. Water extents\nare derived for selected reservoirs using Sentinel-1 SAR Ground Range Detected\n(GRD) images in Vietnam and Sri Lanka from 2017 to 2022. The images are\npre-processed and corrected using terrain correction and refined Lee filter. An\nautomated thresholding algorithm, OTSU, distinguishes water and land, taking\nadvantage of both VV and VH polarization data. The connected pixel count\nthreshold is applied to enhance result accuracy. The results indicate a clear\nrelationship between rainfall patterns and reservoir water extent, with\nincreased precipitation during the monsoon season leading to higher water\nextents in the later months. This study contributes to understanding how\nrainfall variability impacts reservoir water resources in tropical monsoon\nregions. The preliminary findings can inform water resource management\nstrategies and support these countries' decision-making processes related to\nhydropower generation, flood management, and irrigation.\n","authors":["Punsisi Rajakaruna","Surajit Ghosh","Bunyod Holmatov"],"pdf_url":"https://arxiv.org/pdf/2310.05682v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07394v1","updated":"2023-10-11T11:26:35Z","published":"2023-10-11T11:26:35Z","title":"CLIP for Lightweight Semantic Segmentation","summary":" The large-scale pretrained model CLIP, trained on 400 million image-text\npairs, offers a promising paradigm for tackling vision tasks, albeit at the\nimage level. Later works, such as DenseCLIP and LSeg, extend this paradigm to\ndense prediction, including semantic segmentation, and have achieved excellent\nresults. However, the above methods either rely on CLIP-pretrained visual\nbackbones or use none-pretrained but heavy backbones such as Swin, while\nfalling ineffective when applied to lightweight backbones. The reason for this\nis that the lightweitht networks, feature extraction ability of which are\nrelatively limited, meet difficulty embedding the image feature aligned with\ntext embeddings perfectly. In this work, we present a new feature fusion module\nwhich tackles this problem and enables language-guided paradigm to be applied\nto lightweight networks. Specifically, the module is a parallel design of CNN\nand transformer with a two-way bridge in between, where CNN extracts spatial\ninformation and visual context of the feature map from the image encoder, and\nthe transformer propagates text embeddings from the text encoder forward. The\ncore of the module is the bidirectional fusion of visual and text feature\nacross the bridge which prompts their proximity and alignment in embedding\nspace. The module is model-agnostic, which can not only make language-guided\nlightweight semantic segmentation practical, but also fully exploit the\npretrained knowledge of language priors and achieve better performance than\nprevious SOTA work, such as DenseCLIP, whatever the vision backbone is.\nExtensive experiments have been conducted to demonstrate the superiority of our\nmethod.\n","authors":["Ke Jin","Wankou Yang"],"pdf_url":"https://arxiv.org/pdf/2310.07394v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07379v1","updated":"2023-10-11T10:54:44Z","published":"2023-10-11T10:54:44Z","title":"Causal Unsupervised Semantic Segmentation","summary":" Unsupervised semantic segmentation aims to achieve high-quality semantic\ngrouping without human-labeled annotations. With the advent of self-supervised\npre-training, various frameworks utilize the pre-trained features to train\nprediction heads for unsupervised dense prediction. However, a significant\nchallenge in this unsupervised setup is determining the appropriate level of\nclustering required for segmenting concepts. To address it, we propose a novel\nframework, CAusal Unsupervised Semantic sEgmentation (CAUSE), which leverages\ninsights from causal inference. Specifically, we bridge intervention-oriented\napproach (i.e., frontdoor adjustment) to define suitable two-step tasks for\nunsupervised prediction. The first step involves constructing a concept\nclusterbook as a mediator, which represents possible concept prototypes at\ndifferent levels of granularity in a discretized form. Then, the mediator\nestablishes an explicit link to the subsequent concept-wise self-supervised\nlearning for pixel-level grouping. Through extensive experiments and analyses\non various datasets, we corroborate the effectiveness of CAUSE and achieve\nstate-of-the-art performance in unsupervised semantic segmentation.\n","authors":["Junho Kim","Byung-Kwan Lee","Yong Man Ro"],"pdf_url":"https://arxiv.org/pdf/2310.07379v1.pdf","comment":"code available:\n https://github.com/ByungKwanLee/Causal-Unsupervised-Segmentation"},{"id":"http://arxiv.org/abs/2310.07376v1","updated":"2023-10-11T10:50:15Z","published":"2023-10-11T10:50:15Z","title":"Point Cloud Denoising and Outlier Detection with Local Geometric\n Structure by Dynamic Graph CNN","summary":" The digitalization of society is rapidly developing toward the realization of\nthe digital twin and metaverse. In particular, point clouds are attracting\nattention as a media format for 3D space. Point cloud data is contaminated with\nnoise and outliers due to measurement errors. Therefore, denoising and outlier\ndetection are necessary for point cloud processing. Among them, PointCleanNet\nis an effective method for point cloud denoising and outlier detection.\nHowever, it does not consider the local geometric structure of the patch. We\nsolve this problem by applying two types of graph convolutional layer designed\nbased on the Dynamic Graph CNN. Experimental results show that the proposed\nmethods outperform the conventional method in AUPR, which indicates outlier\ndetection accuracy, and Chamfer Distance, which indicates denoising accuracy.\n","authors":["Kosuke Nakayama","Hiroto Fukuta","Hiroshi Watanabe"],"pdf_url":"https://arxiv.org/pdf/2310.07376v1.pdf","comment":"2023 IEEE 12th Global Conference on Consumer Electronics (GCCE 2023)"},{"id":"http://arxiv.org/abs/2303.16570v2","updated":"2023-10-11T10:41:11Z","published":"2023-03-29T10:08:29Z","title":"Point2Vec for Self-Supervised Representation Learning on Point Clouds","summary":" Recently, the self-supervised learning framework data2vec has shown inspiring\nperformance for various modalities using a masked student-teacher approach.\nHowever, it remains open whether such a framework generalizes to the unique\nchallenges of 3D point clouds. To answer this question, we extend data2vec to\nthe point cloud domain and report encouraging results on several downstream\ntasks. In an in-depth analysis, we discover that the leakage of positional\ninformation reveals the overall object shape to the student even under heavy\nmasking and thus hampers data2vec to learn strong representations for point\nclouds. We address this 3D-specific shortcoming by proposing point2vec, which\nunleashes the full potential of data2vec-like pre-training on point clouds. Our\nexperiments show that point2vec outperforms other self-supervised methods on\nshape classification and few-shot learning on ModelNet40 and ScanObjectNN,\nwhile achieving competitive results on part segmentation on ShapeNetParts.\nThese results suggest that the learned representations are strong and\ntransferable, highlighting point2vec as a promising direction for\nself-supervised learning of point cloud representations.\n","authors":["Karim Abou Zeid","Jonas Schult","Alexander Hermans","Bastian Leibe"],"pdf_url":"https://arxiv.org/pdf/2303.16570v2.pdf","comment":"Accepted at GCPR 2023. Project page at\n https://vision.rwth-aachen.de/point2vec"},{"id":"http://arxiv.org/abs/2307.00773v3","updated":"2023-10-11T10:29:59Z","published":"2023-07-03T06:33:49Z","title":"DifFSS: Diffusion Model for Few-Shot Semantic Segmentation","summary":" Diffusion models have demonstrated excellent performance in image generation.\nAlthough various few-shot semantic segmentation (FSS) models with different\nnetwork structures have been proposed, performance improvement has reached a\nbottleneck. This paper presents the first work to leverage the diffusion model\nfor FSS task, called DifFSS. DifFSS, a novel FSS paradigm, can further improve\nthe performance of the state-of-the-art FSS models by a large margin without\nmodifying their network structure. Specifically, we utilize the powerful\ngeneration ability of diffusion models to generate diverse auxiliary support\nimages by using the semantic mask, scribble or soft HED boundary of the support\nimage as control conditions. This generation process simulates the variety\nwithin the class of the query image, such as color, texture variation,\nlighting, $etc$. As a result, FSS models can refer to more diverse support\nimages, yielding more robust representations, thereby achieving a consistent\nimprovement in segmentation performance. Extensive experiments on three\npublicly available datasets based on existing advanced FSS models demonstrate\nthe effectiveness of the diffusion model for FSS task. Furthermore, we explore\nin detail the impact of different input settings of the diffusion model on\nsegmentation performance. Hopefully, this completely new paradigm will bring\ninspiration to the study of FSS task integrated with AI-generated content. Code\nis available at https://github.com/TrinitialChan/DifFSS\n","authors":["Weimin Tan","Siyuan Chen","Bo Yan"],"pdf_url":"https://arxiv.org/pdf/2307.00773v3.pdf","comment":"code is available at https://github.com/TrinitialChan/DifFSS"},{"id":"http://arxiv.org/abs/2305.03989v2","updated":"2023-10-11T10:26:27Z","published":"2023-05-06T09:29:12Z","title":"LEO: Generative Latent Image Animator for Human Video Synthesis","summary":" Spatio-temporal coherency is a major challenge in synthesizing high quality\nvideos, particularly in synthesizing human videos that contain rich global and\nlocal deformations. To resolve this challenge, previous approaches have\nresorted to different features in the generation process aimed at representing\nappearance and motion. However, in the absence of strict mechanisms to\nguarantee such disentanglement, a separation of motion from appearance has\nremained challenging, resulting in spatial distortions and temporal jittering\nthat break the spatio-temporal coherency. Motivated by this, we here propose\nLEO, a novel framework for human video synthesis, placing emphasis on\nspatio-temporal coherency. Our key idea is to represent motion as a sequence of\nflow maps in the generation process, which inherently isolate motion from\nappearance. We implement this idea via a flow-based image animator and a Latent\nMotion Diffusion Model (LMDM). The former bridges a space of motion codes with\nthe space of flow maps, and synthesizes video frames in a warp-and-inpaint\nmanner. LMDM learns to capture motion prior in the training data by\nsynthesizing sequences of motion codes. Extensive quantitative and qualitative\nanalysis suggests that LEO significantly improves coherent synthesis of human\nvideos over previous methods on the datasets TaichiHD, FaceForensics and\nCelebV-HQ. In addition, the effective disentanglement of appearance and motion\nin LEO allows for two additional tasks, namely infinite-length human video\nsynthesis, as well as content-preserving video editing.\n","authors":["Yaohui Wang","Xin Ma","Xinyuan Chen","Antitza Dantcheva","Bo Dai","Yu Qiao"],"pdf_url":"https://arxiv.org/pdf/2305.03989v2.pdf","comment":"Project webpage: https://wyhsirius.github.io/LEO-project/"},{"id":"http://arxiv.org/abs/2310.07361v1","updated":"2023-10-11T10:21:34Z","published":"2023-10-11T10:21:34Z","title":"Domain Generalization Guided by Gradient Signal to Noise Ratio of\n Parameters","summary":" Overfitting to the source domain is a common issue in gradient-based training\nof deep neural networks. To compensate for the over-parameterized models,\nnumerous regularization techniques have been introduced such as those based on\ndropout. While these methods achieve significant improvements on classical\nbenchmarks such as ImageNet, their performance diminishes with the introduction\nof domain shift in the test set i.e. when the unseen data comes from a\nsignificantly different distribution. In this paper, we move away from the\nclassical approach of Bernoulli sampled dropout mask construction and propose\nto base the selection on gradient-signal-to-noise ratio (GSNR) of network's\nparameters. Specifically, at each training step, parameters with high GSNR will\nbe discarded. Furthermore, we alleviate the burden of manually searching for\nthe optimal dropout ratio by leveraging a meta-learning approach. We evaluate\nour method on standard domain generalization benchmarks and achieve competitive\nresults on classification and face anti-spoofing problems.\n","authors":["Mateusz Michalkiewicz","Masoud Faraki","Xiang Yu","Manmohan Chandraker","Mahsa Baktashmotlagh"],"pdf_url":"https://arxiv.org/pdf/2310.07361v1.pdf","comment":"Paper was accepted to ICCV 2023"},{"id":"http://arxiv.org/abs/2310.07359v1","updated":"2023-10-11T10:17:41Z","published":"2023-10-11T10:17:41Z","title":"Diagnosing Bipolar Disorder from 3-D Structural Magnetic Resonance\n Images Using a Hybrid GAN-CNN Method","summary":" Bipolar Disorder (BD) is a psychiatric condition diagnosed by repetitive\ncycles of hypomania and depression. Since diagnosing BD relies on subjective\nbehavioral assessments over a long period, a solid diagnosis based on objective\ncriteria is not straightforward. The current study responded to the described\nobstacle by proposing a hybrid GAN-CNN model to diagnose BD from 3-D structural\nMRI Images (sMRI). The novelty of this study stems from diagnosing BD from sMRI\nsamples rather than conventional datasets such as functional MRI (fMRI),\nelectroencephalography (EEG), and behavioral symptoms while removing the data\ninsufficiency usually encountered when dealing with sMRI samples. The impact of\nvarious augmentation ratios is also tested using 5-fold cross-validation. Based\non the results, this study obtains an accuracy rate of 75.8%, a sensitivity of\n60.3%, and a specificity of 82.5%, which are 3-5% higher than prior work while\nutilizing less than 6% sample counts. Next, it is demonstrated that a 2- D\nlayer-based GAN generator can effectively reproduce complex 3D brain samples, a\nmore straightforward technique than manual image processing. Lastly, the\noptimum augmentation threshold for the current study using 172 sMRI samples is\n50%, showing the applicability of the described method for larger sMRI\ndatasets. In conclusion, it is established that data augmentation using GAN\nimproves the accuracy of the CNN classifier using sMRI samples, thus developing\nmore reliable decision support systems to assist practitioners in identifying\nBD patients more reliably and in a shorter period\n","authors":["Masood Hamed Saghayan","Mohammad Hossein Zolfagharnasab","Ali Khadem","Farzam Matinfar","Hassan Rashidi"],"pdf_url":"https://arxiv.org/pdf/2310.07359v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07355v1","updated":"2023-10-11T10:12:43Z","published":"2023-10-11T10:12:43Z","title":"IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-training","summary":" In the field of medical Vision-Language Pre-training (VLP), significant\nefforts have been devoted to deriving text and image features from both\nclinical reports and associated medical images. However, most existing methods\nmay have overlooked the opportunity in leveraging the inherent hierarchical\nstructure of clinical reports, which are generally split into `findings' for\ndescriptive content and `impressions' for conclusive observation. Instead of\nutilizing this rich, structured format, current medical VLP approaches often\nsimplify the report into either a unified entity or fragmented tokens. In this\nwork, we propose a novel clinical prior guided VLP framework named IMITATE to\nlearn the structure information from medical reports with hierarchical\nvision-language alignment. The framework derives multi-level visual features\nfrom the chest X-ray (CXR) images and separately aligns these features with the\ndescriptive and the conclusive text encoded in the hierarchical medical report.\nFurthermore, a new clinical-informed contrastive loss is introduced for\ncross-modal learning, which accounts for clinical prior knowledge in\nformulating sample correlations in contrastive learning. The proposed model,\nIMITATE, outperforms baseline VLP methods across six different datasets,\nspanning five medical imaging downstream tasks. Comprehensive experimental\nresults highlight the advantages of integrating the hierarchical structure of\nmedical reports for vision-language alignment.\n","authors":["Che Liu","Sibo Cheng","Miaojing Shi","Anand Shah","Wenjia Bai","Rossella Arcucci"],"pdf_url":"https://arxiv.org/pdf/2310.07355v1.pdf","comment":"Under Review"},{"id":"http://arxiv.org/abs/2307.00040v2","updated":"2023-10-11T10:11:36Z","published":"2023-06-30T17:37:48Z","title":"DisCo: Disentangled Control for Realistic Human Dance Generation","summary":" Generative AI has made significant strides in computer vision, particularly\nin text-driven image/video synthesis (T2I/T2V). Despite the notable\nadvancements, it remains challenging in human-centric content synthesis such as\nrealistic dance generation. Current methodologies, primarily tailored for human\nmotion transfer, encounter difficulties when confronted with real-world dance\nscenarios (e.g., social media dance) which require to generalize across a wide\nspectrum of poses and intricate human details. In this paper, we depart from\nthe traditional paradigm of human motion transfer and emphasize two additional\ncritical attributes for the synthesis of human dance content in social media\ncontexts: (i) Generalizability: the model should be able to generalize beyond\ngeneric human viewpoints as well as unseen human subjects, backgrounds, and\nposes; (ii) Compositionality: it should allow for composition of seen/unseen\nsubjects, backgrounds, and poses from different sources seamlessly. To address\nthese challenges, we introduce DisCo, which includes a novel model architecture\nwith disentangled control to improve the compositionality of dance synthesis,\nand an effective human attribute pre-training for better generalizability to\nunseen humans. Extensive qualitative and quantitative results demonstrate that\nDisCo can generate high-quality human dance images and videos with diverse\nappearances and flexible motions. Code, demo, video and visualization are\navailable at: https://disco-dance.github.io/.\n","authors":["Tan Wang","Linjie Li","Kevin Lin","Yuanhao Zhai","Chung-Ching Lin","Zhengyuan Yang","Hanwang Zhang","Zicheng Liu","Lijuan Wang"],"pdf_url":"https://arxiv.org/pdf/2307.00040v2.pdf","comment":"Project Page: https://disco-dance.github.io/ ; Add temporal module ;\n Synchronize FVD computation with MCVD ; More baselines and visualizations"},{"id":"http://arxiv.org/abs/2307.02869v2","updated":"2023-10-11T10:03:08Z","published":"2023-07-06T09:12:13Z","title":"MomentDiff: Generative Video Moment Retrieval from Random to Real","summary":" Video moment retrieval pursues an efficient and generalized solution to\nidentify the specific temporal segments within an untrimmed video that\ncorrespond to a given language description. To achieve this goal, we provide a\ngenerative diffusion-based framework called MomentDiff, which simulates a\ntypical human retrieval process from random browsing to gradual localization.\nSpecifically, we first diffuse the real span to random noise, and learn to\ndenoise the random noise to the original span with the guidance of similarity\nbetween text and video. This allows the model to learn a mapping from arbitrary\nrandom locations to real moments, enabling the ability to locate segments from\nrandom initialization. Once trained, MomentDiff could sample random temporal\nsegments as initial guesses and iteratively refine them to generate an accurate\ntemporal boundary. Different from discriminative works (e.g., based on\nlearnable proposals or queries), MomentDiff with random initialized spans could\nresist the temporal location biases from datasets. To evaluate the influence of\nthe temporal location biases, we propose two anti-bias datasets with location\ndistribution shifts, named Charades-STA-Len and Charades-STA-Mom. The\nexperimental results demonstrate that our efficient framework consistently\noutperforms state-of-the-art methods on three public benchmarks, and exhibits\nbetter generalization and robustness on the proposed anti-bias datasets. The\ncode, model, and anti-bias evaluation datasets are available at\nhttps://github.com/IMCCretrieval/MomentDiff.\n","authors":["Pandeng Li","Chen-Wei Xie","Hongtao Xie","Liming Zhao","Lei Zhang","Yun Zheng","Deli Zhao","Yongdong Zhang"],"pdf_url":"https://arxiv.org/pdf/2307.02869v2.pdf","comment":"19 pages, 6 figures"},{"id":"http://arxiv.org/abs/2304.06841v2","updated":"2023-10-11T09:53:51Z","published":"2023-04-13T22:20:54Z","title":"Video alignment using unsupervised learning of local and global features","summary":" In this paper, we tackle the problem of video alignment, the process of\nmatching the frames of a pair of videos containing similar actions. The main\nchallenge in video alignment is that accurate correspondence should be\nestablished despite the differences in the execution processes and appearances\nbetween the two videos. We introduce an unsupervised method for alignment that\nuses global and local features of the frames. In particular, we introduce\neffective features for each video frame using three machine vision tools:\nperson detection, pose estimation, and VGG network. Then, the features are\nprocessed and combined to construct a multidimensional time series that\nrepresents the video. The resulting time series are used to align videos of the\nsame actions using a novel version of dynamic time warping named Diagonalized\nDynamic Time Warping(DDTW). The main advantage of our approach is that no\ntraining is required, which makes it applicable for any new type of action\nwithout any need to collect training samples for it. For evaluation, we\nconsidered video synchronization and phase classification tasks on the Penn\naction dataset. Also, for an effective evaluation of the video synchronization\ntask, we present a new metric called Enclosed Area Error(EAE). The results show\nthat our method outperforms previous state-of-the-art methods, such as TCC, and\nother self-supervised and weakly supervised methods.\n","authors":["Niloufar Fakhfour","Mohammad ShahverdiKondori","Hoda Mohammadzade"],"pdf_url":"https://arxiv.org/pdf/2304.06841v2.pdf","comment":"19 pages, 6 figures"},{"id":"http://arxiv.org/abs/2310.02044v2","updated":"2023-10-11T09:21:23Z","published":"2023-10-03T13:35:49Z","title":"Video Transformers under Occlusion: How Physics and Background\n Attributes Impact Large Models for Robotic Manipulation","summary":" As transformer architectures and dataset sizes continue to scale, the need to\nunderstand the specific dataset factors affecting model performance becomes\nincreasingly urgent. This paper investigates how object physics attributes\n(color, friction coefficient, shape) and background characteristics (static,\ndynamic, background complexity) influence the performance of Video Transformers\nin trajectory prediction tasks under occlusion. Beyond mere occlusion\nchallenges, this study aims to investigate three questions: How do object\nphysics attributes and background characteristics influence the model\nperformance? What kinds of attributes are most influential to the model\ngeneralization? Is there a data saturation point for large transformer model\nperformance within a single task? To facilitate this research, we present\nOccluManip, a real-world video-based robot pushing dataset comprising 460,000\nconsistent recordings of objects with different physics and varying\nbackgrounds. 1.4 TB and in total 1278 hours of high-quality videos of flexible\ntemporal length along with target object trajectories are collected,\naccommodating tasks with different temporal requirements. Additionally, we\npropose Video Occlusion Transformer (VOT), a generic video-transformer-based\nnetwork achieving an average 96% accuracy across all 18 sub-datasets provided\nin OccluManip. OccluManip and VOT will be released at:\nhttps://github.com/ShutongJIN/OccluManip.git\n","authors":["Shutong Jin","Ruiyu Wang","Muhammad Zahid","Florian T. Pokorny"],"pdf_url":"https://arxiv.org/pdf/2310.02044v2.pdf","comment":"Under review at IEEE ICRA 2024"},{"id":"http://arxiv.org/abs/2310.07324v1","updated":"2023-10-11T09:14:30Z","published":"2023-10-11T09:14:30Z","title":"Guided Attention for Interpretable Motion Captioning","summary":" While much effort has been invested in generating human motion from text,\nrelatively few studies have been dedicated to the reverse direction, that is,\ngenerating text from motion. Much of the research focuses on maximizing\ngeneration quality without any regard for the interpretability of the\narchitectures, particularly regarding the influence of particular body parts in\nthe generation and the temporal synchronization of words with specific\nmovements and actions. This study explores the combination of movement encoders\nwith spatio-temporal attention models and proposes strategies to guide the\nattention during training to highlight perceptually pertinent areas of the\nskeleton in time. We show that adding guided attention with adaptive gate leads\nto interpretable captioning while improving performance compared to higher\nparameter-count non-interpretable SOTA systems. On the KIT MLD dataset, we\nobtain a BLEU@4 of 24.4% (SOTA+6%), a ROUGE-L of 58.30% (SOTA +14.1%), a CIDEr\nof 112.10 (SOTA +32.6) and a Bertscore of 41.20% (SOTA +18.20%). On HumanML3D,\nwe obtain a BLEU@4 of 25.00 (SOTA +2.7%), a ROUGE-L score of 55.4% (SOTA\n+6.1%), a CIDEr of 61.6 (SOTA -10.9%), a Bertscore of 40.3% (SOTA +2.5%). Our\ncode implementation and reproduction details will be soon available at\nhttps://github.com/rd20karim/M2T-Interpretable/tree/main.\n","authors":["Karim Radouane","Andon Tchechmedjiev","Sylvie Ranwez","Julien Lagarde"],"pdf_url":"https://arxiv.org/pdf/2310.07324v1.pdf","comment":"arXiv preprint"},{"id":"http://arxiv.org/abs/2310.07322v1","updated":"2023-10-11T09:12:42Z","published":"2023-10-11T09:12:42Z","title":"A webcam-based machine learning approach for three-dimensional range of\n motion evaluation","summary":" Background. Joint range of motion (ROM) is an important quantitative measure\nfor physical therapy. Commonly relying on a goniometer, accurate and reliable\nROM measurement requires extensive training and practice. This, in turn,\nimposes a significant barrier for those who have limited in-person access to\nhealthcare.\n Objective. The current study presents and evaluates an alternative machine\nlearning-based ROM evaluation method that could be remotely accessed via a\nwebcam.\n Methods. To evaluate its reliability, the ROM measurements for a diverse set\nof joints (neck, spine, and upper and lower extremities) derived using this\nmethod were compared to those obtained from a marker-based optical motion\ncapture system.\n Results. Data collected from 25 healthy adults demonstrated that the webcam\nsolution exhibited high test-retest reliability, with substantial to almost\nperfect intraclass correlation coefficients for most joints. Compared with the\nmarker-based system, the webcam-based system demonstrated substantial to almost\nperfect inter-rater reliability for some joints, and lower inter-rater\nreliability for other joints (e.g., shoulder flexion and elbow flexion), which\ncould be attributed to the reduced sensitivity to joint locations at the apex\nof the movement.\n Conclusions. The proposed webcam-based method exhibited high test-retest and\ninter-rater reliability, making it a versatile alternative for existing ROM\nevaluation methods in clinical practice and the tele-implementation of physical\ntherapy and rehabilitation.\n","authors":["Xiaoye Michael Wang","Derek T. Smith","Qin Zhu"],"pdf_url":"https://arxiv.org/pdf/2310.07322v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07310v1","updated":"2023-10-11T08:47:29Z","published":"2023-10-11T08:47:29Z","title":"Deep Aramaic: Towards a Synthetic Data Paradigm Enabling Machine\n Learning in Epigraphy","summary":" Epigraphy increasingly turns to modern artificial intelligence (AI)\ntechnologies such as machine learning (ML) for extracting insights from ancient\ninscriptions. However, scarce labeled data for training ML algorithms severely\nlimits current techniques, especially for ancient scripts like Old Aramaic. Our\nresearch pioneers an innovative methodology for generating synthetic training\ndata tailored to Old Aramaic letters. Our pipeline synthesizes photo-realistic\nAramaic letter datasets, incorporating textural features, lighting, damage, and\naugmentations to mimic real-world inscription diversity. Despite minimal real\nexamples, we engineer a dataset of 250,000 training and 25,000 validation\nimages covering the 22 letter classes in the Aramaic alphabet. This\ncomprehensive corpus provides a robust volume of data for training a residual\nneural network (ResNet) to classify highly degraded Aramaic letters. The ResNet\nmodel demonstrates high accuracy in classifying real images from the 8th\ncentury BCE Hadad statue inscription. Additional experiments validate\nperformance on varying materials and styles, proving effective generalization.\nOur results validate the model's capabilities in handling diverse real-world\nscenarios, proving the viability of our synthetic data approach and avoiding\nthe dependence on scarce training data that has constrained epigraphic\nanalysis. Our innovative framework elevates interpretation accuracy on damaged\ninscriptions, thus enhancing knowledge extraction from these historical\nresources.\n","authors":["Andrei C. Aioanei","Regine Hunziker-Rodewald","Konstantin Klein","Dominik L. Michels"],"pdf_url":"https://arxiv.org/pdf/2310.07310v1.pdf","comment":"41 pages, 19 images"},{"id":"http://arxiv.org/abs/2206.07255v2","updated":"2023-10-11T08:41:34Z","published":"2022-06-15T02:35:51Z","title":"GRAM-HD: 3D-Consistent Image Generation at High Resolution with\n Generative Radiance Manifolds","summary":" Recent works have shown that 3D-aware GANs trained on unstructured single\nimage collections can generate multiview images of novel instances. The key\nunderpinnings to achieve this are a 3D radiance field generator and a volume\nrendering process. However, existing methods either cannot generate\nhigh-resolution images (e.g., up to 256X256) due to the high computation cost\nof neural volume rendering, or rely on 2D CNNs for image-space upsampling which\njeopardizes the 3D consistency across different views. This paper proposes a\nnovel 3D-aware GAN that can generate high resolution images (up to 1024X1024)\nwhile keeping strict 3D consistency as in volume rendering. Our motivation is\nto achieve super-resolution directly in the 3D space to preserve 3D\nconsistency. We avoid the otherwise prohibitively-expensive computation cost by\napplying 2D convolutions on a set of 2D radiance manifolds defined in the\nrecent generative radiance manifold (GRAM) approach, and apply dedicated loss\nfunctions for effective GAN training at high resolution. Experiments on FFHQ\nand AFHQv2 datasets show that our method can produce high-quality 3D-consistent\nresults that significantly outperform existing methods. It makes a significant\nstep towards closing the gap between traditional 2D image generation and\n3D-consistent free-view generation.\n","authors":["Jianfeng Xiang","Jiaolong Yang","Yu Deng","Xin Tong"],"pdf_url":"https://arxiv.org/pdf/2206.07255v2.pdf","comment":"ICCV2023 camera ready version (more results and method comparisons).\n Project page: https://jeffreyxiang.github.io/GRAM-HD/"},{"id":"http://arxiv.org/abs/2301.11986v2","updated":"2023-10-11T08:25:26Z","published":"2023-01-27T20:54:58Z","title":"Enhancing Face Recognition with Latent Space Data Augmentation and\n Facial Posture Reconstruction","summary":" The small amount of training data for many state-of-the-art deep\nlearning-based Face Recognition (FR) systems causes a marked deterioration in\ntheir performance. Although a considerable amount of research has addressed\nthis issue by inventing new data augmentation techniques, using either input\nspace transformations or Generative Adversarial Networks (GAN) for feature\nspace augmentations, these techniques have yet to satisfy expectations. In this\npaper, we propose an approach named the Face Representation Augmentation (FRA)\nfor augmenting face datasets. To the best of our knowledge, FRA is the first\nmethod that shifts its focus towards manipulating the face embeddings generated\nby any face representation learning algorithm to create new embeddings\nrepresenting the same identity and facial emotion but with an altered posture.\nExtensive experiments conducted in this study convince of the efficacy of our\nmethodology and its power to provide noiseless, completely new facial\nrepresentations to improve the training procedure of any FR algorithm.\nTherefore, FRA can help the recent state-of-the-art FR methods by providing\nmore data for training FR systems. The proposed method, using experiments\nconducted on the Karolinska Directed Emotional Faces (KDEF) dataset, improves\nthe identity classification accuracies by 9.52 %, 10.04 %, and 16.60 %, in\ncomparison with the base models of MagFace, ArcFace, and CosFace, respectively.\n","authors":["Soroush Hashemifar","Abdolreza Marefat","Javad Hassannataj Joloudari","Hamid Hassanpour"],"pdf_url":"https://arxiv.org/pdf/2301.11986v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.04189v2","updated":"2023-10-11T08:01:11Z","published":"2023-10-06T12:08:15Z","title":"Bridging the Gap between Human Motion and Action Semantics via Kinematic\n Phrases","summary":" The goal of motion understanding is to establish a reliable mapping between\nmotion and action semantics, while it is a challenging many-to-many problem. An\nabstract action semantic (i.e., walk forwards) could be conveyed by\nperceptually diverse motions (walk with arms up or swinging), while a motion\ncould carry different semantics w.r.t. its context and intention. This makes an\nelegant mapping between them difficult. Previous attempts adopted\ndirect-mapping paradigms with limited reliability. Also, current automatic\nmetrics fail to provide reliable assessments of the consistency between motions\nand action semantics. We identify the source of these problems as the\nsignificant gap between the two modalities. To alleviate this gap, we propose\nKinematic Phrases (KP) that take the objective kinematic facts of human motion\nwith proper abstraction, interpretability, and generality characteristics.\nBased on KP as a mediator, we can unify a motion knowledge base and build a\nmotion understanding system. Meanwhile, KP can be automatically converted from\nmotions and to text descriptions with no subjective bias, inspiring Kinematic\nPrompt Generation (KPG) as a novel automatic motion generation benchmark. In\nextensive experiments, our approach shows superiority over other methods. Our\ncode and data would be made publicly available at https://foruck.github.io/KP.\n","authors":["Xinpeng Liu","Yong-Lu Li","Ailing Zeng","Zizheng Zhou","Yang You","Cewu Lu"],"pdf_url":"https://arxiv.org/pdf/2310.04189v2.pdf","comment":"Yong-Lu Li and Cewu Lu are the corresponding authors. Project page is\n available at https://foruck.github.io/KP/"},{"id":"http://arxiv.org/abs/2310.07265v1","updated":"2023-10-11T07:45:37Z","published":"2023-10-11T07:45:37Z","title":"Distilling Efficient Vision Transformers from CNNs for Semantic\n Segmentation","summary":" In this paper, we tackle a new problem: how to transfer knowledge from the\npre-trained cumbersome yet well-performed CNN-based model to learn a compact\nVision Transformer (ViT)-based model while maintaining its learning capacity?\nDue to the completely different characteristics of ViT and CNN and the\nlong-existing capacity gap between teacher and student models in Knowledge\nDistillation (KD), directly transferring the cross-model knowledge is\nnon-trivial. To this end, we subtly leverage the visual and\nlinguistic-compatible feature character of ViT (i.e., student), and its\ncapacity gap with the CNN (i.e., teacher) and propose a novel CNN-to-ViT KD\nframework, dubbed C2VKD. Importantly, as the teacher's features are\nheterogeneous to those of the student, we first propose a novel\nvisual-linguistic feature distillation (VLFD) module that explores efficient KD\namong the aligned visual and linguistic-compatible representations. Moreover,\ndue to the large capacity gap between the teacher and student and the\ninevitable prediction errors of the teacher, we then propose a pixel-wise\ndecoupled distillation (PDD) module to supervise the student under the\ncombination of labels and teacher's predictions from the decoupled target and\nnon-target classes. Experiments on three semantic segmentation benchmark\ndatasets consistently show that the increment of mIoU of our method is over\n200% of the SoTA KD methods\n","authors":["Xu Zheng","Yunhao Luo","Pengyuan Zhou","Lin Wang"],"pdf_url":"https://arxiv.org/pdf/2310.07265v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.13720v5","updated":"2023-10-11T07:39:48Z","published":"2023-06-23T18:08:00Z","title":"Decoupled Diffusion Models: Image to Zero and Zero to Noise","summary":" Recent diffusion probabilistic models (DPMs) have shown remarkable abilities\nof generated content, however, they often suffer from complex forward\nprocesses, resulting in inefficient solutions for the reversed process and\nprolonged sampling times. In this paper, we aim to address the aforementioned\nchallenges by focusing on the diffusion process itself that we propose to\ndecouple the intricate diffusion process into two comparatively simpler process\nto improve the generative efficacy and speed. In particular, we present a novel\ndiffusion paradigm named DDM (Decoupled Diffusion Models) based on the Ito\ndiffusion process, in which the image distribution is approximated by an\nexplicit transition probability while the noise path is controlled by the\nstandard Wiener process. We find that decoupling the diffusion process reduces\nthe learning difficulty and the explicit transition probability improves the\ngenerative speed significantly. We prove a new training objective for DPM,\nwhich enables the model to learn to predict the noise and image components\nseparately. Moreover, given the novel forward diffusion equation, we derive the\nreverse denoising formula of DDM that naturally supports fewer steps of\ngeneration without ordinary differential equation (ODE) based accelerators. Our\nexperiments demonstrate that DDM outperforms previous DPMs by a large margin in\nfewer function evaluations setting and gets comparable performances in long\nfunction evaluations setting. We also show that our framework can be applied to\nimage-conditioned generation and high-resolution image synthesis, and that it\ncan generate high-quality images with only 10 function evaluations.\n","authors":["Yuhang Huang","Liang Zheng","Zheng Qin","Xinwang Liu","Kai Xu"],"pdf_url":"https://arxiv.org/pdf/2306.13720v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07259v1","updated":"2023-10-11T07:37:13Z","published":"2023-10-11T07:37:13Z","title":"Uncovering Hidden Connections: Iterative Tracking and Reasoning for\n Video-grounded Dialog","summary":" In contrast to conventional visual question answering, video-grounded dialog\nnecessitates a profound understanding of both dialog history and video content\nfor accurate response generation. Despite commendable strides made by existing\nmethodologies, they often grapple with the challenges of incrementally\nunderstanding intricate dialog histories and assimilating video information. In\nresponse to this gap, we present an iterative tracking and reasoning strategy\nthat amalgamates a textual encoder, a visual encoder, and a generator. At its\ncore, our textual encoder is fortified with a path tracking and aggregation\nmechanism, adept at gleaning nuances from dialog history that are pivotal to\ndeciphering the posed questions. Concurrently, our visual encoder harnesses an\niterative reasoning network, meticulously crafted to distill and emphasize\ncritical visual markers from videos, enhancing the depth of visual\ncomprehension. Culminating this enriched information, we employ the pre-trained\nGPT-2 model as our response generator, stitching together coherent and\ncontextually apt answers. Our empirical assessments, conducted on two renowned\ndatasets, testify to the prowess and adaptability of our proposed design.\n","authors":["Haoyu Zhang","Meng Liu","Yaowei Wang","Da Cao","Weili Guan","Liqiang Nie"],"pdf_url":"https://arxiv.org/pdf/2310.07259v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07255v1","updated":"2023-10-11T07:30:37Z","published":"2023-10-11T07:30:37Z","title":"ADASR: An Adversarial Auto-Augmentation Framework for Hyperspectral and\n Multispectral Data Fusion","summary":" Deep learning-based hyperspectral image (HSI) super-resolution, which aims to\ngenerate high spatial resolution HSI (HR-HSI) by fusing hyperspectral image\n(HSI) and multispectral image (MSI) with deep neural networks (DNNs), has\nattracted lots of attention. However, neural networks require large amounts of\ntraining data, hindering their application in real-world scenarios. In this\nletter, we propose a novel adversarial automatic data augmentation framework\nADASR that automatically optimizes and augments HSI-MSI sample pairs to enrich\ndata diversity for HSI-MSI fusion. Our framework is sample-aware and optimizes\nan augmentor network and two downsampling networks jointly by adversarial\nlearning so that we can learn more robust downsampling networks for training\nthe upsampling network. Extensive experiments on two public classical\nhyperspectral datasets demonstrate the effectiveness of our ADASR compared to\nthe state-of-the-art methods.\n","authors":["Jinghui Qin","Lihuang Fang","Ruitao Lu","Liang Lin","Yukai Shi"],"pdf_url":"https://arxiv.org/pdf/2310.07255v1.pdf","comment":"This paper has been accepted by IEEE Geoscience and Remote Sensing\n Letters. Code is released at https://github.com/fangfang11-plog/ADASR"},{"id":"http://arxiv.org/abs/2310.07252v1","updated":"2023-10-11T07:30:01Z","published":"2023-10-11T07:30:01Z","title":"A Comparative Study of Pre-trained CNNs and GRU-Based Attention for\n Image Caption Generation","summary":" Image captioning is a challenging task involving generating a textual\ndescription for an image using computer vision and natural language processing\ntechniques. This paper proposes a deep neural framework for image caption\ngeneration using a GRU-based attention mechanism. Our approach employs multiple\npre-trained convolutional neural networks as the encoder to extract features\nfrom the image and a GRU-based language model as the decoder to generate\ndescriptive sentences. To improve performance, we integrate the Bahdanau\nattention model with the GRU decoder to enable learning to focus on specific\nimage parts. We evaluate our approach using the MSCOCO and Flickr30k datasets\nand show that it achieves competitive scores compared to state-of-the-art\nmethods. Our proposed framework can bridge the gap between computer vision and\nnatural language and can be extended to specific domains.\n","authors":["Rashid Khan","Bingding Huang","Haseeb Hassan","Asim Zaman","Zhongfu Ye"],"pdf_url":"https://arxiv.org/pdf/2310.07252v1.pdf","comment":"15pages, 10 figures, 5 tables. 2023 the 5th International Conference\n on Robotics and Computer Vision (ICRCV 2023). arXiv admin note: substantial\n text overlap with arXiv:2203.01594"},{"id":"http://arxiv.org/abs/2310.07250v1","updated":"2023-10-11T07:27:28Z","published":"2023-10-11T07:27:28Z","title":"Synthesizing Missing MRI Sequences from Available Modalities using\n Generative Adversarial Networks in BraTS Dataset","summary":" Glioblastoma is a highly aggressive and lethal form of brain cancer. Magnetic\nresonance imaging (MRI) plays a significant role in the diagnosis, treatment\nplanning, and follow-up of glioblastoma patients due to its non-invasive and\nradiation-free nature. The International Brain Tumor Segmentation (BraTS)\nchallenge has contributed to generating numerous AI algorithms to accurately\nand efficiently segment glioblastoma sub-compartments using four structural\n(T1, T1Gd, T2, T2-FLAIR) MRI scans. However, these four MRI sequences may not\nalways be available. To address this issue, Generative Adversarial Networks\n(GANs) can be used to synthesize the missing MRI sequences. In this paper, we\nimplement and utilize an open-source GAN approach that takes any three MRI\nsequences as input to generate the missing fourth structural sequence. Our\nproposed approach is contributed to the community-driven generally nuanced deep\nlearning framework (GaNDLF) and demonstrates promising results in synthesizing\nhigh-quality and realistic MRI sequences, enabling clinicians to improve their\ndiagnostic capabilities and support the application of AI methods to brain\ntumor MRI quantification.\n","authors":["Ibrahim Ethem Hamamci"],"pdf_url":"https://arxiv.org/pdf/2310.07250v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07248v1","updated":"2023-10-11T07:25:50Z","published":"2023-10-11T07:25:50Z","title":"IBoxCLA: Towards Robust Box-supervised Segmentation of Polyp via\n Improved Box-dice and Contrastive Latent-anchors","summary":" Box-supervised polyp segmentation attracts increasing attention for its\ncost-effective potential. Existing solutions often rely on learning-free\nmethods or pretrained models to laboriously generate pseudo masks, triggering\nDice constraint subsequently. In this paper, we found that a model guided by\nthe simplest box-filled masks can accurately predict polyp locations/sizes, but\nsuffers from shape collapsing. In response, we propose two innovative learning\nfashions, Improved Box-dice (IBox) and Contrastive Latent-Anchors (CLA), and\ncombine them to train a robust box-supervised model IBoxCLA. The core idea\nbehind IBoxCLA is to decouple the learning of location/size and shape, allowing\nfor focused constraints on each of them. Specifically, IBox transforms the\nsegmentation map into a proxy map using shape decoupling and confusion-region\nswapping sequentially. Within the proxy map, shapes are disentangled, while\nlocations/sizes are encoded as box-like responses. By constraining the proxy\nmap instead of the raw prediction, the box-filled mask can well supervise\nIBoxCLA without misleading its shape learning. Furthermore, CLA contributes to\nshape learning by generating two types of latent anchors, which are learned and\nupdated using momentum and segmented polyps to steadily represent polyp and\nbackground features. The latent anchors facilitate IBoxCLA to capture\ndiscriminative features within and outside boxes in a contrastive manner,\nyielding clearer boundaries. We benchmark IBoxCLA on five public polyp\ndatasets. The experimental results demonstrate the competitive performance of\nIBoxCLA compared to recent fully-supervised polyp segmentation methods, and its\nsuperiority over other box-supervised state-of-the-arts with a relative\nincrease of overall mDice and mIoU by at least 6.5% and 7.5%, respectively.\n","authors":["Zhiwei Wang","Qiang Hu","Hongkuan Shi","Li He","Man He","Wenxuan Dai","Ting Li","Yitong Zhang","Dun Li","Mei Liu","Qiang Li"],"pdf_url":"https://arxiv.org/pdf/2310.07248v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07247v1","updated":"2023-10-11T07:24:27Z","published":"2023-10-11T07:24:27Z","title":"Optimizing the Placement of Roadside LiDARs for Autonomous Driving","summary":" Multi-agent cooperative perception is an increasingly popular topic in the\nfield of autonomous driving, where roadside LiDARs play an essential role.\nHowever, how to optimize the placement of roadside LiDARs is a crucial but\noften overlooked problem. This paper proposes an approach to optimize the\nplacement of roadside LiDARs by selecting optimized positions within the scene\nfor better perception performance. To efficiently obtain the best combination\nof locations, a greedy algorithm based on perceptual gain is proposed, which\nselects the location that can maximize the perceptual gain sequentially. We\ndefine perceptual gain as the increased perceptual capability when a new LiDAR\nis placed. To obtain the perception capability, we propose a perception\npredictor that learns to evaluate LiDAR placement using only a single point\ncloud frame. A dataset named Roadside-Opt is created using the CARLA simulator\nto facilitate research on the roadside LiDAR placement problem.\n","authors":["Wentao Jiang","Hao Xiang","Xinyu Cai","Runsheng Xu","Jiaqi Ma","Yikang Li","Gim Hee Lee","Si Liu"],"pdf_url":"https://arxiv.org/pdf/2310.07247v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07245v1","updated":"2023-10-11T07:22:37Z","published":"2023-10-11T07:22:37Z","title":"Crowd Counting in Harsh Weather using Image Denoising with Pix2Pix GANs","summary":" Visual crowd counting estimates the density of the crowd using deep learning\nmodels such as convolution neural networks (CNNs). The performance of the model\nheavily relies on the quality of the training data that constitutes crowd\nimages. In harsh weather such as fog, dust, and low light conditions, the\ninference performance may severely degrade on the noisy and blur images. In\nthis paper, we propose the use of Pix2Pix generative adversarial network (GAN)\nto first denoise the crowd images prior to passing them to the counting model.\nA Pix2Pix network is trained using synthetic noisy images generated from\noriginal crowd images and then the pretrained generator is then used in the\ninference engine to estimate the crowd density in unseen, noisy crowd images.\nThe performance is tested on JHU-Crowd dataset to validate the significance of\nthe proposed method particularly when high reliability and accuracy are\nrequired.\n","authors":["Muhammad Asif Khan","Hamid Menouar","Ridha Hamila"],"pdf_url":"https://arxiv.org/pdf/2310.07245v1.pdf","comment":"The paper has been accepted for presentation in IEEE 38th\n International Conference on Image and Vision Computing New Zealand (IVCNZ\n 2023). The final manuscript can be accessed at ieeexplore"},{"id":"http://arxiv.org/abs/2310.04991v3","updated":"2023-10-11T07:20:32Z","published":"2023-10-08T03:35:27Z","title":"Video-Teller: Enhancing Cross-Modal Generation with Fusion and\n Decoupling","summary":" This paper proposes Video-Teller, a video-language foundation model that\nleverages multi-modal fusion and fine-grained modality alignment to\nsignificantly enhance the video-to-text generation task. Video-Teller boosts\nthe training efficiency by utilizing frozen pretrained vision and language\nmodules. It capitalizes on the robust linguistic capabilities of large language\nmodels, enabling the generation of both concise and elaborate video\ndescriptions. To effectively integrate visual and auditory information,\nVideo-Teller builds upon the image-based BLIP-2 model and introduces a cascaded\nQ-Former which fuses information across frames and ASR texts. To better guide\nvideo summarization, we introduce a fine-grained modality alignment objective,\nwhere the cascaded Q-Former's output embedding is trained to align with the\ncaption/summary embedding created by a pretrained text auto-encoder.\nExperimental results demonstrate the efficacy of our proposed video-language\nfoundation model in accurately comprehending videos and generating coherent and\nprecise language descriptions. It is worth noting that the fine-grained\nalignment enhances the model's capabilities (4% improvement of CIDEr score on\nMSR-VTT) with only 13% extra parameters in training and zero additional cost in\ninference.\n","authors":["Haogeng Liu","Qihang Fan","Tingkai Liu","Linjie Yang","Yunzhe Tao","Huaibo Huang","Ran He","Hongxia Yang"],"pdf_url":"https://arxiv.org/pdf/2310.04991v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.05447v2","updated":"2023-10-11T07:10:49Z","published":"2023-10-09T06:43:48Z","title":"Towards Fair and Comprehensive Comparisons for Image-Based 3D Object\n Detection","summary":" In this work, we build a modular-designed codebase, formulate strong training\nrecipes, design an error diagnosis toolbox, and discuss current methods for\nimage-based 3D object detection. In particular, different from other highly\nmature tasks, e.g., 2D object detection, the community of image-based 3D object\ndetection is still evolving, where methods often adopt different training\nrecipes and tricks resulting in unfair evaluations and comparisons. What is\nworse, these tricks may overwhelm their proposed designs in performance, even\nleading to wrong conclusions. To address this issue, we build a module-designed\ncodebase and formulate unified training standards for the community.\nFurthermore, we also design an error diagnosis toolbox to measure the detailed\ncharacterization of detection models. Using these tools, we analyze current\nmethods in-depth under varying settings and provide discussions for some open\nquestions, e.g., discrepancies in conclusions on KITTI-3D and nuScenes\ndatasets, which have led to different dominant methods for these datasets. We\nhope that this work will facilitate future research in image-based 3D object\ndetection. Our codes will be released at\n\\url{https://github.com/OpenGVLab/3dodi}\n","authors":["Xinzhu Ma","Yongtao Wang","Yinmin Zhang","Zhiyi Xia","Yuan Meng","Zhihui Wang","Haojie Li","Wanli Ouyang"],"pdf_url":"https://arxiv.org/pdf/2310.05447v2.pdf","comment":"ICCV23, code will be released soon"},{"id":"http://arxiv.org/abs/2301.07807v3","updated":"2023-10-11T07:07:18Z","published":"2023-01-18T22:38:03Z","title":"Measuring uncertainty in human visual segmentation","summary":" Segmenting visual stimuli into distinct groups of features and visual objects\nis central to visual function. Classical psychophysical methods have helped\nuncover many rules of human perceptual segmentation, and recent progress in\nmachine learning has produced successful algorithms. Yet, the computational\nlogic of human segmentation remains unclear, partially because we lack\nwell-controlled paradigms to measure perceptual segmentation maps and compare\nmodels quantitatively. Here we propose a new, integrated approach: given an\nimage, we measure multiple pixel-based same--different judgments and perform\nmodel--based reconstruction of the underlying segmentation map. The\nreconstruction is robust to several experimental manipulations and captures the\nvariability of individual participants. We demonstrate the validity of the\napproach on human segmentation of natural images and composite textures. We\nshow that image uncertainty affects measured human variability, and it\ninfluences how participants weigh different visual features. Because any\nputative segmentation algorithm can be inserted to perform the reconstruction,\nour paradigm affords quantitative tests of theories of perception as well as\nnew benchmarks for segmentation algorithms.\n","authors":["Jonathan Vacher","Claire Launay","Pascal Mamassian","Ruben Coen-Cagli"],"pdf_url":"https://arxiv.org/pdf/2301.07807v3.pdf","comment":"32 pages, 9 figures, 5 appendix, 5 figures in appendix"},{"id":"http://arxiv.org/abs/2305.17382v3","updated":"2023-10-11T07:02:45Z","published":"2023-05-27T06:24:43Z","title":"APRIL-GAN: A Zero-/Few-Shot Anomaly Classification and Segmentation\n Method for CVPR 2023 VAND Workshop Challenge Tracks 1&2: 1st Place on\n Zero-shot AD and 4th Place on Few-shot AD","summary":" In this technical report, we briefly introduce our solution for the\nZero/Few-shot Track of the Visual Anomaly and Novelty Detection (VAND) 2023\nChallenge. For industrial visual inspection, building a single model that can\nbe rapidly adapted to numerous categories without or with only a few normal\nreference images is a promising research direction. This is primarily because\nof the vast variety of the product types. For the zero-shot track, we propose a\nsolution based on the CLIP model by adding extra linear layers. These layers\nare used to map the image features to the joint embedding space, so that they\ncan compare with the text features to generate the anomaly maps. Besides, when\nthe reference images are available, we utilize multiple memory banks to store\ntheir features and compare them with the features of the test images during the\ntesting phase. In this challenge, our method achieved first place in the\nzero-shot track, especially excelling in segmentation with an impressive F1\nscore improvement of 0.0489 over the second-ranked participant. Furthermore, in\nthe few-shot track, we secured the fourth position overall, with our\nclassification F1 score of 0.8687 ranking first among all participating teams.\n","authors":["Xuhai Chen","Yue Han","Jiangning Zhang"],"pdf_url":"https://arxiv.org/pdf/2305.17382v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.16254v3","updated":"2023-10-11T06:59:47Z","published":"2023-03-28T18:59:17Z","title":"CryoFormer: Continuous Heterogeneous Cryo-EM Reconstruction using\n Transformer-based Neural Representations","summary":" Cryo-electron microscopy (cryo-EM) allows for the high-resolution\nreconstruction of 3D structures of proteins and other biomolecules. Successful\nreconstruction of both shape and movement greatly helps understand the\nfundamental processes of life. However, it is still challenging to reconstruct\nthe continuous motions of 3D structures from hundreds of thousands of noisy and\nrandomly oriented 2D cryo-EM images. Recent advancements use Fourier domain\ncoordinate-based neural networks to continuously model 3D conformations, yet\nthey often struggle to capture local flexible regions accurately. We propose\nCryoFormer, a new approach for continuous heterogeneous cryo-EM reconstruction.\nOur approach leverages an implicit feature volume directly in the real domain\nas the 3D representation. We further introduce a novel query-based deformation\ntransformer decoder to improve the reconstruction quality. Our approach is\ncapable of refining pre-computed pose estimations and locating flexible\nregions. In experiments, our method outperforms current approaches on three\npublic datasets (1 synthetic and 2 experimental) and a new synthetic dataset of\nPEDV spike protein. The code and new synthetic dataset will be released for\nbetter reproducibility of our results. Project page:\nhttps://cryoformer.github.io.\n","authors":["Xinhang Liu","Yan Zeng","Yifan Qin","Hao Li","Jiakai Zhang","Lan Xu","Jingyi Yu"],"pdf_url":"https://arxiv.org/pdf/2303.16254v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07237v1","updated":"2023-10-11T06:58:22Z","published":"2023-10-11T06:58:22Z","title":"SAGE-ICP: Semantic Information-Assisted ICP","summary":" Robust and accurate pose estimation in unknown environments is an essential\npart of robotic applications. We focus on LiDAR-based point-to-point ICP\ncombined with effective semantic information. This paper proposes a novel\nsemantic information-assisted ICP method named SAGE-ICP, which leverages\nsemantics in odometry. The semantic information for the whole scan is timely\nand efficiently extracted by a 3D convolution network, and these point-wise\nlabels are deeply involved in every part of the registration, including\nsemantic voxel downsampling, data association, adaptive local map, and dynamic\nvehicle removal. Unlike previous semantic-aided approaches, the proposed method\ncan improve localization accuracy in large-scale scenes even if the semantic\ninformation has certain errors. Experimental evaluations on KITTI and KITTI-360\nshow that our method outperforms the baseline methods, and improves accuracy\nwhile maintaining real-time performance, i.e., runs faster than the sensor\nframe rate.\n","authors":["Jiaming Cui","Jiming Chen","Liang Li"],"pdf_url":"https://arxiv.org/pdf/2310.07237v1.pdf","comment":"6+1 pages, 4 figures"},{"id":"http://arxiv.org/abs/2310.07236v1","updated":"2023-10-11T06:56:08Z","published":"2023-10-11T06:56:08Z","title":"AdaMesh: Personalized Facial Expressions and Head Poses for\n Speech-Driven 3D Facial Animation","summary":" Speech-driven 3D facial animation aims at generating facial movements that\nare synchronized with the driving speech, which has been widely explored\nrecently. Existing works mostly neglect the person-specific talking style in\ngeneration, including facial expression and head pose styles. Several works\nintend to capture the personalities by fine-tuning modules. However, limited\ntraining data leads to the lack of vividness. In this work, we propose AdaMesh,\na novel adaptive speech-driven facial animation approach, which learns the\npersonalized talking style from a reference video of about 10 seconds and\ngenerates vivid facial expressions and head poses. Specifically, we propose\nmixture-of-low-rank adaptation (MoLoRA) to fine-tune the expression adapter,\nwhich efficiently captures the facial expression style. For the personalized\npose style, we propose a pose adapter by building a discrete pose prior and\nretrieving the appropriate style embedding with a semantic-aware pose style\nmatrix without fine-tuning. Extensive experimental results show that our\napproach outperforms state-of-the-art methods, preserves the talking style in\nthe reference video, and generates vivid facial animation. The supplementary\nvideo and code will be available at https://adamesh.github.io.\n","authors":["Liyang Chen","Weihong Bao","Shun Lei","Boshi Tang","Zhiyong Wu","Shiyin Kang","Haozhi Huang"],"pdf_url":"https://arxiv.org/pdf/2310.07236v1.pdf","comment":"Project Page: https://adamesh.github.io"},{"id":"http://arxiv.org/abs/2309.13438v3","updated":"2023-10-11T06:43:08Z","published":"2023-09-23T17:29:38Z","title":"Rethinking Superpixel Segmentation from Biologically Inspired Mechanisms","summary":" Recently, advancements in deep learning-based superpixel segmentation methods\nhave brought about improvements in both the efficiency and the performance of\nsegmentation. However, a significant challenge remains in generating\nsuperpixels that strictly adhere to object boundaries while conveying rich\nvisual significance, especially when cross-surface color correlations may\ninterfere with objects. Drawing inspiration from neural structure and visual\nmechanisms, we propose a biological network architecture comprising an Enhanced\nScreening Module (ESM) and a novel Boundary-Aware Label (BAL) for superpixel\nsegmentation. The ESM enhances semantic information by simulating the\ninteractive projection mechanisms of the visual cortex. Additionally, the BAL\nemulates the spatial frequency characteristics of visual cortical cells to\nfacilitate the generation of superpixels with strong boundary adherence. We\ndemonstrate the effectiveness of our approach through evaluations on both the\nBSDS500 dataset and the NYUv2 dataset.\n","authors":["Tingyu Zhao","Bo Peng","Yuan Sun","Daipeng Yang","Zhenguang Zhang","Xi Wu"],"pdf_url":"https://arxiv.org/pdf/2309.13438v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.13311v2","updated":"2023-10-11T06:28:41Z","published":"2023-05-22T17:59:45Z","title":"VDT: General-purpose Video Diffusion Transformers via Mask Modeling","summary":" This work introduces Video Diffusion Transformer (VDT), which pioneers the\nuse of transformers in diffusion-based video generation. It features\ntransformer blocks with modularized temporal and spatial attention modules to\nleverage the rich spatial-temporal representation inherited in transformers. We\nalso propose a unified spatial-temporal mask modeling mechanism, seamlessly\nintegrated with the model, to cater to diverse video generation scenarios. VDT\noffers several appealing benefits. 1) It excels at capturing temporal\ndependencies to produce temporally consistent video frames and even simulate\nthe physics and dynamics of 3D objects over time. 2) It facilitates flexible\nconditioning information, \\eg, simple concatenation in the token space,\neffectively unifying different token lengths and modalities. 3) Pairing with\nour proposed spatial-temporal mask modeling mechanism, it becomes a\ngeneral-purpose video diffuser for harnessing a range of tasks, including\nunconditional generation, video prediction, interpolation, animation, and\ncompletion, etc. Extensive experiments on these tasks spanning various\nscenarios, including autonomous driving, natural weather, human action, and\nphysics-based simulation, demonstrate the effectiveness of VDT. Additionally,\nwe present comprehensive studies on how \\model handles conditioning information\nwith the mask modeling mechanism, which we believe will benefit future research\nand advance the field. Project page: https:VDT-2023.github.io\n","authors":["Haoyu Lu","Guoxing Yang","Nanyi Fei","Yuqi Huo","Zhiwu Lu","Ping Luo","Mingyu Ding"],"pdf_url":"https://arxiv.org/pdf/2305.13311v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07223v1","updated":"2023-10-11T06:13:50Z","published":"2023-10-11T06:13:50Z","title":"Deep Learning for blind spectral unmixing of LULC classes with MODIS\n multispectral time series and ancillary data","summary":" Remotely sensed data are dominated by mixed Land Use and Land Cover (LULC)\ntypes. Spectral unmixing is a technique to extract information from mixed\npixels into their constituent LULC types and corresponding abundance fractions.\nTraditionally, solving this task has relied on either classical methods that\nrequire prior knowledge of endmembers or machine learning methods that avoid\nexplicit endmembers calculation, also known as blind spectral unmixing (BSU).\nMost BSU studies based on Deep Learning (DL) focus on one time-step\nhyperspectral data, yet its acquisition remains quite costly compared with\nmultispectral data. To our knowledge, here we provide the first study on BSU of\nLULC classes using multispectral time series data with DL models. We further\nboost the performance of a Long-Short Term Memory (LSTM)-based model by\nincorporating geographic plus topographic (geo-topographic) and climatic\nancillary information. Our experiments show that combining spectral-temporal\ninput data together with geo-topographic and climatic information substantially\nimproves the abundance estimation of LULC classes in mixed pixels. To carry out\nthis study, we built a new labeled dataset of the region of Andalusia (Spain)\nwith monthly multispectral time series of pixels for the year 2013 from MODIS\nat 460m resolution, for two hierarchical levels of LULC classes, named\nAndalusia MultiSpectral MultiTemporal Unmixing (Andalusia-MSMTU). This dataset\nprovides, at the pixel level, a multispectral time series plus ancillary\ninformation annotated with the abundance of each LULC class inside each pixel.\nThe dataset and code are available to the public.\n","authors":["José Rodríguez-Ortega","Rohaifa Khaldi","Domingo Alcaraz-Segura","Siham Tabik"],"pdf_url":"https://arxiv.org/pdf/2310.07223v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07222v1","updated":"2023-10-11T06:11:42Z","published":"2023-10-11T06:11:42Z","title":"Uni-paint: A Unified Framework for Multimodal Image Inpainting with\n Pretrained Diffusion Model","summary":" Recently, text-to-image denoising diffusion probabilistic models (DDPMs) have\ndemonstrated impressive image generation capabilities and have also been\nsuccessfully applied to image inpainting. However, in practice, users often\nrequire more control over the inpainting process beyond textual guidance,\nespecially when they want to composite objects with customized appearance,\ncolor, shape, and layout. Unfortunately, existing diffusion-based inpainting\nmethods are limited to single-modal guidance and require task-specific\ntraining, hindering their cross-modal scalability. To address these\nlimitations, we propose Uni-paint, a unified framework for multimodal\ninpainting that offers various modes of guidance, including unconditional,\ntext-driven, stroke-driven, exemplar-driven inpainting, as well as a\ncombination of these modes. Furthermore, our Uni-paint is based on pretrained\nStable Diffusion and does not require task-specific training on specific\ndatasets, enabling few-shot generalizability to customized images. We have\nconducted extensive qualitative and quantitative evaluations that show our\napproach achieves comparable results to existing single-modal methods while\noffering multimodal inpainting capabilities not available in other methods.\nCode will be available at https://github.com/ysy31415/unipaint.\n","authors":["Shiyuan Yang","Xiaodong Chen","Jing Liao"],"pdf_url":"https://arxiv.org/pdf/2310.07222v1.pdf","comment":"Accepted by ACMMM'23"},{"id":"http://arxiv.org/abs/2310.04895v2","updated":"2023-10-11T05:59:53Z","published":"2023-10-07T18:47:17Z","title":"Cell Tracking-by-detection using Elliptical Bounding Boxes","summary":" Cell detection and tracking are paramount for bio-analysis. Recent approaches\nrely on the tracking-by-model evolution paradigm, which usually consists of\ntraining end-to-end deep learning models to detect and track the cells on the\nframes with promising results. However, such methods require extensive amounts\nof annotated data, which is time-consuming to obtain and often requires\nspecialized annotators. This work proposes a new approach based on the\nclassical tracking-by-detection paradigm that alleviates the requirement of\nannotated data. More precisely, it approximates the cell shapes as oriented\nellipses and then uses generic-purpose oriented object detectors to identify\nthe cells in each frame. We then rely on a global data association algorithm\nthat explores temporal cell similarity using probability distance metrics,\nconsidering that the ellipses relate to two-dimensional Gaussian distributions.\nOur results show that our method can achieve detection and tracking results\ncompetitively with state-of-the-art techniques that require considerably more\nextensive data annotation. Our code is available at:\nhttps://github.com/LucasKirsten/Deep-Cell-Tracking-EBB.\n","authors":["Lucas N. Kirsten","Cláudio R. Jung"],"pdf_url":"https://arxiv.org/pdf/2310.04895v2.pdf","comment":"Paper under review on IEEE/ACM Transactions on Computational Biology\n and Bioinformatics"},{"id":"http://arxiv.org/abs/2310.07212v1","updated":"2023-10-11T05:58:14Z","published":"2023-10-11T05:58:14Z","title":"Multi-Task Learning-Enabled Automatic Vessel Draft Reading for\n Intelligent Maritime Surveillance","summary":" The accurate and efficient vessel draft reading (VDR) is an important\ncomponent of intelligent maritime surveillance, which could be exploited to\nassist in judging whether the vessel is normally loaded or overloaded. The\ncomputer vision technique with an excellent price-to-performance ratio has\nbecome a popular medium to estimate vessel draft depth. However, the\ntraditional estimation methods easily suffer from several limitations, such as\nsensitivity to low-quality images, high computational cost, etc. In this work,\nwe propose a multi-task learning-enabled computational method (termed MTL-VDR)\nfor generating highly reliable VDR. In particular, our MTL-VDR mainly consists\nof four components, i.e., draft mark detection, draft scale recognition,\nvessel/water segmentation, and final draft depth estimation. We first construct\na benchmark dataset related to draft mark detection and employ a powerful and\nefficient convolutional neural network to accurately perform the detection\ntask. The multi-task learning method is then proposed for simultaneous draft\nscale recognition and vessel/water segmentation. To obtain more robust VDR\nunder complex conditions (e.g., damaged and stained scales, etc.), the accurate\ndraft scales are generated by an automatic correction method, which is\npresented based on the spatial distribution rules of draft scales. Finally, an\nadaptive computational method is exploited to yield an accurate and robust\ndraft depth. Extensive experiments have been implemented on the realistic\ndataset to compare our MTL-VDR with state-of-the-art methods. The results have\ndemonstrated its superior performance in terms of accuracy, robustness, and\nefficiency. The computational speed exceeds 40 FPS, which satisfies the\nrequirements of real-time maritime surveillance to guarantee vessel traffic\nsafety.\n","authors":["Jingxiang Qu","Ryan Wen Liu","Chenjie Zhao","Yu Guo","Sendren Sheng-Dong Xu","Fenghua Zhu","Yisheng Lv"],"pdf_url":"https://arxiv.org/pdf/2310.07212v1.pdf","comment":"12 pages,11 figures, submitted to IEEE T-ITS"},{"id":"http://arxiv.org/abs/2310.07209v1","updated":"2023-10-11T05:49:47Z","published":"2023-10-11T05:49:47Z","title":"Multi-task Explainable Skin Lesion Classification","summary":" Skin cancer is one of the deadliest diseases and has a high mortality rate if\nleft untreated. The diagnosis generally starts with visual screening and is\nfollowed by a biopsy or histopathological examination. Early detection can aid\nin lowering mortality rates. Visual screening can be limited by the experience\nof the doctor. Due to the long tail distribution of dermatological datasets and\nsignificant intra-variability between classes, automatic classification\nutilizing computer-aided methods becomes challenging. In this work, we propose\na multitask few-shot-based approach for skin lesions that generalizes well with\nfew labelled data to address the small sample space challenge. The proposed\napproach comprises a fusion of a segmentation network that acts as an attention\nmodule and classification network. The output of the segmentation network helps\nto focus on the most discriminatory features while making a decision by the\nclassification network. To further enhance the classification performance, we\nhave combined segmentation and classification loss in a weighted manner. We\nhave also included the visualization results that explain the decisions made by\nthe algorithm. Three dermatological datasets are used to evaluate the proposed\nmethod thoroughly. We also conducted cross-database experiments to ensure that\nthe proposed approach is generalizable across similar datasets. Experimental\nresults demonstrate the efficacy of the proposed work.\n","authors":["Mahapara Khurshid","Mayank Vatsa","Richa Singh"],"pdf_url":"https://arxiv.org/pdf/2310.07209v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.05341v3","updated":"2023-10-11T05:46:28Z","published":"2023-10-09T01:59:49Z","title":"A Critical Look at Classic Test-Time Adaptation Methods in Semantic\n Segmentation","summary":" Test-time adaptation (TTA) aims to adapt a model, initially trained on\ntraining data, to potential distribution shifts in the test data. Most existing\nTTA studies, however, focus on classification tasks, leaving a notable gap in\nthe exploration of TTA for semantic segmentation. This pronounced emphasis on\nclassification might lead numerous newcomers and engineers to mistakenly assume\nthat classic TTA methods designed for classification can be directly applied to\nsegmentation. Nonetheless, this assumption remains unverified, posing an open\nquestion. To address this, we conduct a systematic, empirical study to disclose\nthe unique challenges of segmentation TTA, and to determine whether classic TTA\nstrategies can effectively address this task. Our comprehensive results have\nled to three key observations. First, the classic batch norm updating strategy,\ncommonly used in classification TTA, only brings slight performance\nimprovement, and in some cases it might even adversely affect the results. Even\nwith the application of advanced distribution estimation techniques like batch\nrenormalization, the problem remains unresolved. Second, the teacher-student\nscheme does enhance training stability for segmentation TTA in the presence of\nnoisy pseudo-labels. However, it cannot directly result in performance\nimprovement compared to the original model without TTA. Third, segmentation TTA\nsuffers a severe long-tailed imbalance problem, which is substantially more\ncomplex than that in TTA for classification. This long-tailed challenge\nsignificantly affects segmentation TTA performance, even when the accuracy of\npseudo-labels is high. In light of these observations, we conclude that TTA for\nsegmentation presents significant challenges, and simply using classic TTA\nmethods cannot address this problem well.\n","authors":["Chang'an Yi","Haotian Chen","Yifan Zhang","Yonghui Xu","Lizhen Cui"],"pdf_url":"https://arxiv.org/pdf/2310.05341v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.05917v2","updated":"2023-10-11T05:41:16Z","published":"2023-10-09T17:59:12Z","title":"Drivable Avatar Clothing: Faithful Full-Body Telepresence with Dynamic\n Clothing Driven by Sparse RGB-D Input","summary":" Clothing is an important part of human appearance but challenging to model in\nphotorealistic avatars. In this work we present avatars with dynamically moving\nloose clothing that can be faithfully driven by sparse RGB-D inputs as well as\nbody and face motion. We propose a Neural Iterative Closest Point (N-ICP)\nalgorithm that can efficiently track the coarse garment shape given sparse\ndepth input. Given the coarse tracking results, the input RGB-D images are then\nremapped to texel-aligned features, which are fed into the drivable avatar\nmodels to faithfully reconstruct appearance details. We evaluate our method\nagainst recent image-driven synthesis baselines, and conduct a comprehensive\nanalysis of the N-ICP algorithm. We demonstrate that our method can generalize\nto a novel testing environment, while preserving the ability to produce\nhigh-fidelity and faithful clothing dynamics and appearance.\n","authors":["Donglai Xiang","Fabian Prada","Zhe Cao","Kaiwen Guo","Chenglei Wu","Jessica Hodgins","Timur Bagautdinov"],"pdf_url":"https://arxiv.org/pdf/2310.05917v2.pdf","comment":"SIGGRAPH Asia 2023 Conference Paper. Project website:\n https://xiangdonglai.github.io/www-sa23-drivable-clothing/"},{"id":"http://arxiv.org/abs/2310.07206v1","updated":"2023-10-11T05:34:36Z","published":"2023-10-11T05:34:36Z","title":"DeepSimHO: Stable Pose Estimation for Hand-Object Interaction via\n Physics Simulation","summary":" This paper addresses the task of 3D pose estimation for a hand interacting\nwith an object from a single image observation. When modeling hand-object\ninteraction, previous works mainly exploit proximity cues, while overlooking\nthe dynamical nature that the hand must stably grasp the object to counteract\ngravity and thus preventing the object from slipping or falling. These works\nfail to leverage dynamical constraints in the estimation and consequently often\nproduce unstable results. Meanwhile, refining unstable configurations with\nphysics-based reasoning remains challenging, both by the complexity of contact\ndynamics and by the lack of effective and efficient physics inference in the\ndata-driven learning framework. To address both issues, we present DeepSimHO: a\nnovel deep-learning pipeline that combines forward physics simulation and\nbackward gradient approximation with a neural network. Specifically, for an\ninitial hand-object pose estimated by a base network, we forward it to a\nphysics simulator to evaluate its stability. However, due to non-smooth contact\ngeometry and penetration, existing differentiable simulators can not provide\nreliable state gradient. To remedy this, we further introduce a deep network to\nlearn the stability evaluation process from the simulator, while smoothly\napproximating its gradient and thus enabling effective back-propagation.\nExtensive experiments show that our method noticeably improves the stability of\nthe estimation and achieves superior efficiency over test-time optimization.\nThe code is available at https://github.com/rongakowang/DeepSimHO.\n","authors":["Rong Wang","Wei Mao","Hongdong Li"],"pdf_url":"https://arxiv.org/pdf/2310.07206v1.pdf","comment":"Accepted to NeurIPS 2023"},{"id":"http://arxiv.org/abs/2309.17421v2","updated":"2023-10-11T05:07:37Z","published":"2023-09-29T17:34:51Z","title":"The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)","summary":" Large multimodal models (LMMs) extend large language models (LLMs) with\nmulti-sensory skills, such as visual understanding, to achieve stronger generic\nintelligence. In this paper, we analyze the latest model, GPT-4V(ision), to\ndeepen the understanding of LMMs. The analysis focuses on the intriguing tasks\nthat GPT-4V can perform, containing test samples to probe the quality and\ngenericity of GPT-4V's capabilities, its supported inputs and working modes,\nand the effective ways to prompt the model. In our approach to exploring\nGPT-4V, we curate and organize a collection of carefully designed qualitative\nsamples spanning a variety of domains and tasks. Observations from these\nsamples demonstrate that GPT-4V's unprecedented ability in processing\narbitrarily interleaved multimodal inputs and the genericity of its\ncapabilities together make GPT-4V a powerful multimodal generalist system.\nFurthermore, GPT-4V's unique capability of understanding visual markers drawn\non input images can give rise to new human-computer interaction methods such as\nvisual referring prompting. We conclude the report with in-depth discussions on\nthe emerging application scenarios and the future research directions for\nGPT-4V-based systems. We hope that this preliminary exploration will inspire\nfuture research on the next-generation multimodal task formulation, new ways to\nexploit and enhance LMMs to solve real-world problems, and gaining better\nunderstanding of multimodal foundation models. Finally, we acknowledge that the\nmodel under our study is solely the product of OpenAI's innovative work, and\nthey should be fully credited for its development. Please see the GPT-4V\ncontributions paper for the authorship and credit attribution:\nhttps://cdn.openai.com/contributions/gpt-4v.pdf\n","authors":["Zhengyuan Yang","Linjie Li","Kevin Lin","Jianfeng Wang","Chung-Ching Lin","Zicheng Liu","Lijuan Wang"],"pdf_url":"https://arxiv.org/pdf/2309.17421v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07189v1","updated":"2023-10-11T04:38:21Z","published":"2023-10-11T04:38:21Z","title":"SpikePoint: An Efficient Point-based Spiking Neural Network for Event\n Cameras Action Recognition","summary":" Event cameras are bio-inspired sensors that respond to local changes in light\nintensity and feature low latency, high energy efficiency, and high dynamic\nrange. Meanwhile, Spiking Neural Networks (SNNs) have gained significant\nattention due to their remarkable efficiency and fault tolerance. By\nsynergistically harnessing the energy efficiency inherent in event cameras and\nthe spike-based processing capabilities of SNNs, their integration could enable\nultra-low-power application scenarios, such as action recognition tasks.\nHowever, existing approaches often entail converting asynchronous events into\nconventional frames, leading to additional data mapping efforts and a loss of\nsparsity, contradicting the design concept of SNNs and event cameras. To\naddress this challenge, we propose SpikePoint, a novel end-to-end point-based\nSNN architecture. SpikePoint excels at processing sparse event cloud data,\neffectively extracting both global and local features through a singular-stage\nstructure. Leveraging the surrogate training method, SpikePoint achieves high\naccuracy with few parameters and maintains low power consumption, specifically\nemploying the identity mapping feature extractor on diverse datasets.\nSpikePoint achieves state-of-the-art (SOTA) performance on four event-based\naction recognition datasets using only 16 timesteps, surpassing other SNN\nmethods. Moreover, it also achieves SOTA performance across all methods on\nthree datasets, utilizing approximately 0.3\\% of the parameters and 0.5\\% of\npower consumption employed by artificial neural networks (ANNs). These results\nemphasize the significance of Point Cloud and pave the way for many\nultra-low-power event-based data processing applications.\n","authors":["Hongwei Ren","Yue Zhou","Yulong Huang","Haotian Fu","Xiaopeng Lin","Jie Song","Bojun Cheng"],"pdf_url":"https://arxiv.org/pdf/2310.07189v1.pdf","comment":"Under review"},{"id":"http://arxiv.org/abs/2310.07186v1","updated":"2023-10-11T04:25:24Z","published":"2023-10-11T04:25:24Z","title":"Multiview Transformer: Rethinking Spatial Information in Hyperspectral\n Image Classification","summary":" Identifying the land cover category for each pixel in a hyperspectral image\n(HSI) relies on spectral and spatial information. An HSI cuboid with a specific\npatch size is utilized to extract spatial-spectral feature representation for\nthe central pixel. In this article, we investigate that scene-specific but not\nessential correlations may be recorded in an HSI cuboid. This additional\ninformation improves the model performance on existing HSI datasets and makes\nit hard to properly evaluate the ability of a model. We refer to this problem\nas the spatial overfitting issue and utilize strict experimental settings to\navoid it. We further propose a multiview transformer for HSI classification,\nwhich consists of multiview principal component analysis (MPCA), spectral\nencoder-decoder (SED), and spatial-pooling tokenization transformer (SPTT).\nMPCA performs dimension reduction on an HSI via constructing spectral multiview\nobservations and applying PCA on each view data to extract low-dimensional view\nrepresentation. The combination of view representations, named multiview\nrepresentation, is the dimension reduction output of the MPCA. To aggregate the\nmultiview information, a fully-convolutional SED with a U-shape in spectral\ndimension is introduced to extract a multiview feature map. SPTT transforms the\nmultiview features into tokens using the spatial-pooling tokenization strategy\nand learns robust and discriminative spatial-spectral features for land cover\nidentification. Classification is conducted with a linear classifier.\nExperiments on three HSI datasets with rigid settings demonstrate the\nsuperiority of the proposed multiview transformer over the state-of-the-art\nmethods.\n","authors":["Jie Zhang","Yongshan Zhang","Yicong Zhou"],"pdf_url":"https://arxiv.org/pdf/2310.07186v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07184v1","updated":"2023-10-11T04:20:32Z","published":"2023-10-11T04:20:32Z","title":"NeuroInspect: Interpretable Neuron-based Debugging Framework through\n Class-conditional Visualizations","summary":" Despite deep learning (DL) has achieved remarkable progress in various\ndomains, the DL models are still prone to making mistakes. This issue\nnecessitates effective debugging tools for DL practitioners to interpret the\ndecision-making process within the networks. However, existing debugging\nmethods often demand extra data or adjustments to the decision process,\nlimiting their applicability. To tackle this problem, we present NeuroInspect,\nan interpretable neuron-based debugging framework with three key stages:\ncounterfactual explanations, feature visualizations, and false correlation\nmitigation. Our debugging framework first pinpoints neurons responsible for\nmistakes in the network and then visualizes features embedded in the neurons to\nbe human-interpretable. To provide these explanations, we introduce\nCLIP-Illusion, a novel feature visualization method that generates images\nrepresenting features conditioned on classes to examine the connection between\nneurons and the decision layer. We alleviate convoluted explanations of the\nconventional visualization approach by employing class information, thereby\nisolating mixed properties. This process offers more human-interpretable\nexplanations for model errors without altering the trained network or requiring\nadditional data. Furthermore, our framework mitigates false correlations\nlearned from a dataset under a stochastic perspective, modifying decisions for\nthe neurons considered as the main causes. We validate the effectiveness of our\nframework by addressing false correlations and improving inferences for classes\nwith the worst performance in real-world settings. Moreover, we demonstrate\nthat NeuroInspect helps debug the mistakes of DL models through evaluation for\nhuman understanding. The code is openly available at\nhttps://github.com/yeongjoonJu/NeuroInspect.\n","authors":["Yeong-Joon Ju","Ji-Hoon Park","Seong-Whan Lee"],"pdf_url":"https://arxiv.org/pdf/2310.07184v1.pdf","comment":"Summitted to IEEE Transactions on Neural Networks and Learning\n Systems (TNNLS)"},{"id":"http://arxiv.org/abs/2310.07179v1","updated":"2023-10-11T04:05:11Z","published":"2023-10-11T04:05:11Z","title":"rpcPRF: Generalizable MPI Neural Radiance Field for Satellite Camera","summary":" Novel view synthesis of satellite images holds a wide range of practical\napplications. While recent advances in the Neural Radiance Field have\npredominantly targeted pin-hole cameras, and models for satellite cameras often\ndemand sufficient input views. This paper presents rpcPRF, a Multiplane Images\n(MPI) based Planar neural Radiance Field for Rational Polynomial Camera (RPC).\nUnlike coordinate-based neural radiance fields in need of sufficient views of\none scene, our model is applicable to single or few inputs and performs well on\nimages from unseen scenes. To enable generalization across scenes, we propose\nto use reprojection supervision to induce the predicted MPI to learn the\ncorrect geometry between the 3D coordinates and the images. Moreover, we remove\nthe stringent requirement of dense depth supervision from deep\nmultiview-stereo-based methods by introducing rendering techniques of radiance\nfields. rpcPRF combines the superiority of implicit representations and the\nadvantages of the RPC model, to capture the continuous altitude space while\nlearning the 3D structure. Given an RGB image and its corresponding RPC, the\nend-to-end model learns to synthesize the novel view with a new RPC and\nreconstruct the altitude of the scene. When multiple views are provided as\ninputs, rpcPRF exerts extra supervision provided by the extra views. On the TLC\ndataset from ZY-3, and the SatMVS3D dataset with urban scenes from WV-3, rpcPRF\noutperforms state-of-the-art nerf-based methods by a significant margin in\nterms of image fidelity, reconstruction accuracy, and efficiency, for both\nsingle-view and multiview task.\n","authors":["Tongtong Zhang","Yuanxiang Li"],"pdf_url":"https://arxiv.org/pdf/2310.07179v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.05136v3","updated":"2023-10-11T04:04:07Z","published":"2023-10-08T12:10:44Z","title":"InstructDET: Diversifying Referring Object Detection with Generalized\n Instructions","summary":" We propose InstructDET, a data-centric method for referring object detection\n(ROD) that localizes target objects based on user instructions. While deriving\nfrom referring expressions (REC), the instructions we leverage are greatly\ndiversified to encompass common user intentions related to object detection.\nFor one image, we produce tremendous instructions that refer to every single\nobject and different combinations of multiple objects. Each instruction and its\ncorresponding object bounding boxes (bbxs) constitute one training data pair.\nIn order to encompass common detection expressions, we involve emerging\nvision-language model (VLM) and large language model (LLM) to generate\ninstructions guided by text prompts and object bbxs, as the generalizations of\nfoundation models are effective to produce human-like expressions (e.g.,\ndescribing object property, category, and relationship). We name our\nconstructed dataset as InDET. It contains images, bbxs and generalized\ninstructions that are from foundation models. Our InDET is developed from\nexisting REC datasets and object detection datasets, with the expanding\npotential that any image with object bbxs can be incorporated through using our\nInstructDET method. By using our InDET dataset, we show that a conventional ROD\nmodel surpasses existing methods on standard REC datasets and our InDET test\nset. Our data-centric method InstructDET, with automatic data expansion by\nleveraging foundation models, directs a promising field that ROD can be greatly\ndiversified to execute common object detection instructions.\n","authors":["Ronghao Dang","Jiangyan Feng","Haodong Zhang","Chongjian Ge","Lin Song","Lijun Gong","Chengju Liu","Qijun Chen","Feng Zhu","Rui Zhao","Yibing Song"],"pdf_url":"https://arxiv.org/pdf/2310.05136v3.pdf","comment":"Adjust the subject"},{"id":"http://arxiv.org/abs/2310.07176v1","updated":"2023-10-11T04:00:17Z","published":"2023-10-11T04:00:17Z","title":"Improving mitosis detection on histopathology images using large\n vision-language models","summary":" In certain types of cancerous tissue, mitotic count has been shown to be\nassociated with tumor proliferation, poor prognosis, and therapeutic\nresistance. Due to the high inter-rater variability of mitotic counting by\npathologists, convolutional neural networks (CNNs) have been employed to reduce\nthe subjectivity of mitosis detection in hematoxylin and eosin (H&E)-stained\nwhole slide images. However, most existing models have performance that lags\nbehind expert panel review and only incorporate visual information. In this\nwork, we demonstrate that pre-trained large-scale vision-language models that\nleverage both visual features and natural language improve mitosis detection\naccuracy. We formulate the mitosis detection task as an image captioning task\nand a visual question answering (VQA) task by including metadata such as tumor\nand scanner types as context. The effectiveness of our pipeline is demonstrated\nvia comparison with various baseline models using 9,501 mitotic figures and\n11,051 hard negatives (non-mitotic figures that are difficult to characterize)\nfrom the publicly available Mitosis Domain Generalization Challenge (MIDOG22)\ndataset.\n","authors":["Ruiwen Ding","James Hall","Neil Tenenholtz","Kristen Severson"],"pdf_url":"https://arxiv.org/pdf/2310.07176v1.pdf","comment":"Submitted to IEEE ISBI 2024. Under review"},{"id":"http://arxiv.org/abs/2310.07166v1","updated":"2023-10-11T03:29:13Z","published":"2023-10-11T03:29:13Z","title":"Anchor-based Multi-view Subspace Clustering with Hierarchical Feature\n Descent","summary":" Multi-view clustering has attracted growing attention owing to its\ncapabilities of aggregating information from various sources and its promising\nhorizons in public affairs. Up till now, many advanced approaches have been\nproposed in recent literature. However, there are several ongoing difficulties\nto be tackled. One common dilemma occurs while attempting to align the features\nof different views. We dig out as well as deploy the dependency amongst views\nthrough hierarchical feature descent, which leads to a common latent space(\nSTAGE 1). This latent space, for the first time of its kind, is regarded as a\n'resemblance space', as it reveals certain correlations and dependencies of\ndifferent views. To be exact, the one-hot encoding of a category can also be\nreferred to as a resemblance space in its terminal phase. Moreover, due to the\nintrinsic fact that most of the existing multi-view clustering algorithms stem\nfrom k-means clustering and spectral clustering, this results in cubic time\ncomplexity w.r.t. the number of the objects. However, we propose Anchor-based\nMulti-view Subspace Clustering with Hierarchical Feature Descent(MVSC-HFD) to\nfurther reduce the computing complexity to linear time cost through a unified\nsampling strategy in resemblance space( STAGE 2), followed by subspace\nclustering to learn the representation collectively( STAGE 3). Extensive\nexperimental results on public benchmark datasets demonstrate that our proposed\nmodel consistently outperforms the state-of-the-art techniques.\n","authors":["Qiyuan Ou","Siwei Wang","Pei Zhang","Sihang Zhou","En Zhu"],"pdf_url":"https://arxiv.org/pdf/2310.07166v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.17546v2","updated":"2023-10-11T03:19:18Z","published":"2023-03-30T17:13:56Z","title":"PAIR-Diffusion: A Comprehensive Multimodal Object-Level Image Editor","summary":" Generative image editing has recently witnessed extremely fast-paced growth.\nSome works use high-level conditioning such as text, while others use low-level\nconditioning. Nevertheless, most of them lack fine-grained control over the\nproperties of the different objects present in the image, i.e.\\,object-level\nimage editing. In this work, we tackle the task by perceiving the images as an\namalgamation of various objects and aim to control the properties of each\nobject in a fine-grained manner. Out of these properties, we identify structure\nand appearance as the most intuitive to understand and useful for editing\npurposes. We propose \\textbf{PAIR} Diffusion, a generic framework that can\nenable a diffusion model to control the structure and appearance properties of\neach object in the image. We show that having control over the properties of\neach object in an image leads to comprehensive editing capabilities. Our\nframework allows for various object-level editing operations on real images\nsuch as reference image-based appearance editing, free-form shape editing,\nadding objects, and variations. Thanks to our design, we do not require any\ninversion step. Additionally, we propose multimodal classifier-free guidance\nwhich enables editing images using both reference images and text when using\nour approach with foundational diffusion models. We validate the above claims\nby extensively evaluating our framework on both unconditional and foundational\ndiffusion models. Please refer to\nhttps://vidit98.github.io/publication/conference-paper/pair_diff.html for code\nand model release.\n","authors":["Vidit Goel","Elia Peruzzo","Yifan Jiang","Dejia Xu","Xingqian Xu","Nicu Sebe","Trevor Darrell","Zhangyang Wang","Humphrey Shi"],"pdf_url":"https://arxiv.org/pdf/2303.17546v2.pdf","comment":"26 pages and 17 figures"},{"id":"http://arxiv.org/abs/2202.07870v2","updated":"2023-10-11T03:11:14Z","published":"2022-02-16T05:47:31Z","title":"IPD:An Incremental Prototype based DBSCAN for large-scale data with\n cluster representatives","summary":" DBSCAN is a fundamental density-based clustering technique that identifies\nany arbitrary shape of the clusters. However, it becomes infeasible while\nhandling big data. On the other hand, centroid-based clustering is important\nfor detecting patterns in a dataset since unprocessed data points can be\nlabeled to their nearest centroid. However, it can not detect non-spherical\nclusters. For a large data, it is not feasible to store and compute labels of\nevery samples. These can be done as and when the information is required. The\npurpose can be accomplished when clustering act as a tool to identify cluster\nrepresentatives and query is served by assigning cluster labels of nearest\nrepresentative. In this paper, we propose an Incremental Prototype-based DBSCAN\n(IPD) algorithm which is designed to identify arbitrary-shaped clusters for\nlarge-scale data. Additionally, it chooses a set of representatives for each\ncluster.\n","authors":["Jayasree Saha","Jayanta Mukherjee"],"pdf_url":"https://arxiv.org/pdf/2202.07870v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.06925v2","updated":"2023-10-11T02:52:46Z","published":"2023-04-14T05:21:47Z","title":"YOLO-Drone:Airborne real-time detection of dense small objects from\n high-altitude perspective","summary":" Unmanned Aerial Vehicles (UAVs), specifically drones equipped with remote\nsensing object detection technology, have rapidly gained a broad spectrum of\napplications and emerged as one of the primary research focuses in the field of\ncomputer vision. Although UAV remote sensing systems have the ability to detect\nvarious objects, small-scale objects can be challenging to detect reliably due\nto factors such as object size, image degradation, and real-time limitations.\nTo tackle these issues, a real-time object detection algorithm (YOLO-Drone) is\nproposed and applied to two new UAV platforms as well as a specific light\nsource (silicon-based golden LED). YOLO-Drone presents several novelties: 1)\nincluding a new backbone Darknet59; 2) a new complex feature aggregation module\nMSPP-FPN that incorporated one spatial pyramid pooling and three atrous spatial\npyramid pooling modules; 3) and the use of Generalized Intersection over Union\n(GIoU) as the loss function. To evaluate performance, two benchmark datasets,\nUAVDT and VisDrone, along with one homemade dataset acquired at night under\nsilicon-based golden LEDs, are utilized. The experimental results show that, in\nboth UAVDT and VisDrone, the proposed YOLO-Drone outperforms state-of-the-art\n(SOTA) object detection methods by improving the mAP of 10.13% and 8.59%,\nrespectively. With regards to UAVDT, the YOLO-Drone exhibits both high\nreal-time inference speed of 53 FPS and a maximum mAP of 34.04%. Notably,\nYOLO-Drone achieves high performance under the silicon-based golden LEDs, with\na mAP of up to 87.71%, surpassing the performance of YOLO series under ordinary\nlight sources. To conclude, the proposed YOLO-Drone is a highly effective\nsolution for object detection in UAV applications, particularly for night\ndetection tasks where silicon-based golden light LED technology exhibits\nsignificant superiority.\n","authors":["Li Zhu","Jiahui Xiong","Feng Xiong","Hanzheng Hu","Zhengnan Jiang"],"pdf_url":"https://arxiv.org/pdf/2304.06925v2.pdf","comment":"Some contributing authors are not signed"},{"id":"http://arxiv.org/abs/2310.07149v1","updated":"2023-10-11T02:50:16Z","published":"2023-10-11T02:50:16Z","title":"Robust Unsupervised Domain Adaptation by Retaining Confident Entropy via\n Edge Concatenation","summary":" The generalization capability of unsupervised domain adaptation can mitigate\nthe need for extensive pixel-level annotations to train semantic segmentation\nnetworks by training models on synthetic data as a source with\ncomputer-generated annotations. Entropy-based adversarial networks are proposed\nto improve source domain prediction; however, they disregard significant\nexternal information, such as edges, which have the potential to identify and\ndistinguish various objects within an image accurately. To address this issue,\nwe introduce a novel approach to domain adaptation, leveraging the synergy of\ninternal and external information within entropy-based adversarial networks. In\nthis approach, we enrich the discriminator network with edge-predicted\nprobability values within this innovative framework to enhance the clarity of\nclass boundaries. Furthermore, we devised a probability-sharing network that\nintegrates diverse information for more effective segmentation. Incorporating\nobject edges addresses a pivotal aspect of unsupervised domain adaptation that\nhas frequently been neglected in the past -- the precise delineation of object\nboundaries. Conventional unsupervised domain adaptation methods usually center\naround aligning feature distributions and may not explicitly model object\nboundaries. Our approach effectively bridges this gap by offering clear\nguidance on object boundaries, thereby elevating the quality of domain\nadaptation. Our approach undergoes rigorous evaluation on the established\nunsupervised domain adaptation benchmarks, specifically in adapting SYNTHIA\n$\\rightarrow$ Cityscapes and SYNTHIA $\\rightarrow$ Mapillary. Experimental\nresults show that the proposed model attains better performance than\nstate-of-the-art methods. The superior performance across different\nunsupervised domain adaptation scenarios highlights the versatility and\nrobustness of the proposed method.\n","authors":["Hye-Seong Hong","Abhishek Kumar","Dong-Gyu Lee"],"pdf_url":"https://arxiv.org/pdf/2310.07149v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.06282v2","updated":"2023-10-11T02:46:12Z","published":"2023-10-10T03:32:33Z","title":"MuseChat: A Conversational Music Recommendation System for Videos","summary":" We introduce MuseChat, an innovative dialog-based music recommendation\nsystem. This unique platform not only offers interactive user engagement but\nalso suggests music tailored for input videos, so that users can refine and\npersonalize their music selections. In contrast, previous systems predominantly\nemphasized content compatibility, often overlooking the nuances of users'\nindividual preferences. For example, all the datasets only provide basic\nmusic-video pairings or such pairings with textual music descriptions. To\naddress this gap, our research offers three contributions. First, we devise a\nconversation-synthesis method that simulates a two-turn interaction between a\nuser and a recommendation system, which leverages pre-trained music tags and\nartist information. In this interaction, users submit a video to the system,\nwhich then suggests a suitable music piece with a rationale. Afterwards, users\ncommunicate their musical preferences, and the system presents a refined music\nrecommendation with reasoning. Second, we introduce a multi-modal\nrecommendation engine that matches music either by aligning it with visual cues\nfrom the video or by harmonizing visual information, feedback from previously\nrecommended music, and the user's textual input. Third, we bridge music\nrepresentations and textual data with a Large Language Model(Vicuna-7B). This\nalignment equips MuseChat to deliver music recommendations and their underlying\nreasoning in a manner resembling human communication. Our evaluations show that\nMuseChat surpasses existing state-of-the-art models in music retrieval tasks\nand pioneers the integration of the recommendation process within a natural\nlanguage framework.\n","authors":["Zhikang Dong","Bin Chen","Xiulong Liu","Pawel Polak","Peng Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.06282v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.15854v2","updated":"2023-10-11T02:34:23Z","published":"2023-08-30T08:40:15Z","title":"Zero-shot Inversion Process for Image Attribute Editing with Diffusion\n Models","summary":" Denoising diffusion models have shown outstanding performance in image\nediting. Existing works tend to use either image-guided methods, which provide\na visual reference but lack control over semantic coherence, or text-guided\nmethods, which ensure faithfulness to text guidance but lack visual quality. To\naddress the problem, we propose the Zero-shot Inversion Process (ZIP), a\nframework that injects a fusion of generated visual reference and text guidance\ninto the semantic latent space of a \\textit{frozen} pre-trained diffusion\nmodel. Only using a tiny neural network, the proposed ZIP produces diverse\ncontent and attributes under the intuitive control of the text prompt.\nMoreover, ZIP shows remarkable robustness for both in-domain and out-of-domain\nattribute manipulation on real images. We perform detailed experiments on\nvarious benchmark datasets. Compared to state-of-the-art methods, ZIP produces\nimages of equivalent quality while providing a realistic editing effect.\n","authors":["Zhanbo Feng","Zenan Ling","Ci Gong","Feng Zhou","Jie Li","Robert C. Qiu"],"pdf_url":"https://arxiv.org/pdf/2308.15854v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07138v1","updated":"2023-10-11T02:23:18Z","published":"2023-10-11T02:23:18Z","title":"Denoising Task Routing for Diffusion Models","summary":" Diffusion models generate highly realistic images through learning a\nmulti-step denoising process, naturally embodying the principles of multi-task\nlearning (MTL). Despite the inherent connection between diffusion models and\nMTL, there remains an unexplored area in designing neural architectures that\nexplicitly incorporate MTL into the framework of diffusion models. In this\npaper, we present Denoising Task Routing (DTR), a simple add-on strategy for\nexisting diffusion model architectures to establish distinct information\npathways for individual tasks within a single architecture by selectively\nactivating subsets of channels in the model. What makes DTR particularly\ncompelling is its seamless integration of prior knowledge of denoising tasks\ninto the framework: (1) Task Affinity: DTR activates similar channels for tasks\nat adjacent timesteps and shifts activated channels as sliding windows through\ntimesteps, capitalizing on the inherent strong affinity between tasks at\nadjacent timesteps. (2) Task Weights: During the early stages (higher\ntimesteps) of the denoising process, DTR assigns a greater number of\ntask-specific channels, leveraging the insight that diffusion models prioritize\nreconstructing global structure and perceptually rich contents in earlier\nstages, and focus on simple noise removal in later stages. Our experiments\ndemonstrate that DTR consistently enhances the performance of diffusion models\nacross various evaluation protocols, all without introducing additional\nparameters. Furthermore, DTR contributes to accelerating convergence during\ntraining. Finally, we show the complementarity between our architectural\napproach and existing MTL optimization techniques, providing a more complete\nview of MTL within the context of diffusion training.\n","authors":["Byeongjun Park","Sangmin Woo","Hyojun Go","Jin-Young Kim","Changick Kim"],"pdf_url":"https://arxiv.org/pdf/2310.07138v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07131v1","updated":"2023-10-11T02:08:05Z","published":"2023-10-11T02:08:05Z","title":"Echocardiography video synthesis from end diastolic semantic map via\n diffusion model","summary":" Denoising Diffusion Probabilistic Models (DDPMs) have demonstrated\nsignificant achievements in various image and video generation tasks, including\nthe domain of medical imaging. However, generating echocardiography videos\nbased on semantic anatomical information remains an unexplored area of\nresearch. This is mostly due to the constraints imposed by the currently\navailable datasets, which lack sufficient scale and comprehensive frame-wise\nannotations for every cardiac cycle. This paper aims to tackle the\naforementioned challenges by expanding upon existing video diffusion models for\nthe purpose of cardiac video synthesis. More specifically, our focus lies in\ngenerating video using semantic maps of the initial frame during the cardiac\ncycle, commonly referred to as end diastole. To further improve the synthesis\nprocess, we integrate spatial adaptive normalization into multiscale feature\nmaps. This enables the inclusion of semantic guidance during synthesis,\nresulting in enhanced realism and coherence of the resultant video sequences.\nExperiments are conducted on the CAMUS dataset, which is a highly used dataset\nin the field of echocardiography. Our model exhibits better performance\ncompared to the standard diffusion technique in terms of multiple metrics,\nincluding FID, FVD, and SSMI.\n","authors":["Phi Nguyen Van","Duc Tran Minh","Hieu Pham Huy","Long Tran Quoc"],"pdf_url":"https://arxiv.org/pdf/2310.07131v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.01636v2","updated":"2023-10-11T02:02:48Z","published":"2023-10-02T21:02:23Z","title":"Adaptive Visual Scene Understanding: Incremental Scene Graph Generation","summary":" Scene graph generation (SGG) involves analyzing images to extract meaningful\ninformation about objects and their relationships. Given the dynamic nature of\nthe visual world, it becomes crucial for AI systems to detect new objects and\nestablish their new relationships with existing objects. To address the lack of\ncontinual learning methodologies in SGG, we introduce the comprehensive\nContinual ScenE Graph Generation (CSEGG) dataset along with 3 learning\nscenarios and 8 evaluation metrics. Our research investigates the continual\nlearning performances of existing SGG methods on the retention of previous\nobject entities and relationships as they learn new ones. Moreover, we also\nexplore how continual object detection enhances generalization in classifying\nknown relationships on unknown objects. We conduct extensive experiments\nbenchmarking and analyzing the classical two-stage SGG methods and the most\nrecent transformer-based SGG methods in continual learning settings, and gain\nvaluable insights into the CSEGG problem. We invite the research community to\nexplore this emerging field of study.\n","authors":["Naitik Khandelwal","Xiao Liu","Mengmi Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.01636v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.15142v4","updated":"2023-10-11T01:33:21Z","published":"2023-06-27T02:03:46Z","title":"LRANet: Towards Accurate and Efficient Scene Text Detection with\n Low-Rank Approximation Network","summary":" Recently, regression-based methods, which predict parameterized text shapes\nfor text localization, have gained popularity in scene text detection. However,\nthe existing parameterized text shape methods still have limitations in\nmodeling arbitrary-shaped texts due to ignoring the utilization of\ntext-specific shape information. Moreover, the time consumption of the entire\npipeline has been largely overlooked, leading to a suboptimal overall inference\nspeed. To address these issues, we first propose a novel parameterized text\nshape method based on low-rank approximation. Unlike other shape representation\nmethods that employ data-irrelevant parameterization, our approach utilizes\nsingular value decomposition and reconstructs the text shape using a few\neigenvectors learned from labeled text contours. By exploring the shape\ncorrelation among different text contours, our method achieves consistency,\ncompactness, simplicity, and robustness in shape representation. Next, we\npropose a dual assignment scheme for speed acceleration. It adopts a sparse\nassignment branch to accelerate the inference speed, and meanwhile, provides\nample supervised signals for training through a dense assignment branch.\nBuilding upon these designs, we implement an accurate and efficient\narbitrary-shaped text detector named LRANet. Extensive experiments are\nconducted on several challenging benchmarks, demonstrating the superior\naccuracy and efficiency of LRANet compared to state-of-the-art methods. Code\nwill be released soon.\n","authors":["Yuchen Su","Zhineng Chen","Zhiwen Shao","Yuning Du","Zhilong Ji","Jinfeng Bai","Yong Zhou","Yu-Gang Jiang"],"pdf_url":"https://arxiv.org/pdf/2306.15142v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2106.14349v3","updated":"2023-10-11T01:30:37Z","published":"2021-06-28T00:34:15Z","title":"PNet -- A Deep Learning Based Photometry and Astrometry Bayesian\n Framework","summary":" Time domain astronomy has emerged as a vibrant research field in recent\nyears, focusing on celestial objects that exhibit variable magnitudes or\npositions. Given the urgency of conducting follow-up observations for such\nobjects, the development of an algorithm capable of detecting them and\ndetermining their magnitudes and positions has become imperative. Leveraging\nthe advancements in deep neural networks, we present the PNet, an end-to-end\nframework designed not only to detect celestial objects and extract their\nmagnitudes and positions but also to estimate photometry uncertainty. The PNet\ncomprises two essential steps. Firstly, it detects stars and retrieves their\npositions, magnitudes, and calibrated magnitudes. Subsequently, in the second\nphase, the PNet estimates the uncertainty associated with the photometry\nresults, serving as a valuable reference for the light curve classification\nalgorithm. Our algorithm has been tested using both simulated and real\nobservation data, demonstrating the PNet's ability to deliver consistent and\nreliable outcomes. Integration of the PNet into data processing pipelines for\ntime-domain astronomy holds significant potential for enhancing response speed\nand improving the detection capabilities for celestial objects with variable\npositions and magnitudes.\n","authors":["Rui Sun","Peng Jia","Yongyang Sun","Zhimin Yang","Qiang Liu","Hongyan Wei"],"pdf_url":"https://arxiv.org/pdf/2106.14349v3.pdf","comment":"To be published in the AJ and welcome to any comments"},{"id":"http://arxiv.org/abs/2310.04780v2","updated":"2023-10-11T00:38:50Z","published":"2023-10-07T11:45:33Z","title":"IPMix: Label-Preserving Data Augmentation Method for Training Robust\n Classifiers","summary":" Data augmentation has been proven effective for training high-accuracy\nconvolutional neural network classifiers by preventing overfitting. However,\nbuilding deep neural networks in real-world scenarios requires not only high\naccuracy on clean data but also robustness when data distributions shift. While\nprior methods have proposed that there is a trade-off between accuracy and\nrobustness, we propose IPMix, a simple data augmentation approach to improve\nrobustness without hurting clean accuracy. IPMix integrates three levels of\ndata augmentation (image-level, patch-level, and pixel-level) into a coherent\nand label-preserving technique to increase the diversity of training data with\nlimited computational overhead. To further improve the robustness, IPMix\nintroduces structural complexity at different levels to generate more diverse\nimages and adopts the random mixing method for multi-scale information fusion.\nExperiments demonstrate that IPMix outperforms state-of-the-art corruption\nrobustness on CIFAR-C and ImageNet-C. In addition, we show that IPMix also\nsignificantly improves the other safety measures, including robustness to\nadversarial perturbations, calibration, prediction consistency, and anomaly\ndetection, achieving state-of-the-art or comparable results on several\nbenchmarks, including ImageNet-R, ImageNet-A, and ImageNet-O.\n","authors":["Zhenglin Huang","Xianan Bao","Na Zhang","Qingqi Zhang","Xiaomei Tu","Biao Wu","Xi Yang"],"pdf_url":"https://arxiv.org/pdf/2310.04780v2.pdf","comment":null},{"id":"http://arxiv.org/abs/1910.11103v2","updated":"2023-10-11T00:11:45Z","published":"2019-10-16T23:30:22Z","title":"SPEC2: SPECtral SParsE CNN Accelerator on FPGAs","summary":" To accelerate inference of Convolutional Neural Networks (CNNs), various\ntechniques have been proposed to reduce computation redundancy. Converting\nconvolutional layers into frequency domain significantly reduces the\ncomputation complexity of the sliding window operations in space domain. On the\nother hand, weight pruning techniques address the redundancy in model\nparameters by converting dense convolutional kernels into sparse ones. To\nobtain high-throughput FPGA implementation, we propose SPEC2 -- the first work\nto prune and accelerate spectral CNNs. First, we propose a systematic pruning\nalgorithm based on Alternative Direction Method of Multipliers (ADMM). The\noffline pruning iteratively sets the majority of spectral weights to zero,\nwithout using any handcrafted heuristics. Then, we design an optimized pipeline\narchitecture on FPGA that has efficient random access into the sparse kernels\nand exploits various dimensions of parallelism in convolutional layers.\nOverall, SPEC2 achieves high inference throughput with extremely low\ncomputation complexity and negligible accuracy degradation. We demonstrate\nSPEC2 by pruning and implementing LeNet and VGG16 on the Xilinx Virtex\nplatform. After pruning 75% of the spectral weights, SPEC2 achieves 0% accuracy\nloss for LeNet, and <1% accuracy loss for VGG16. The resulting accelerators\nachieve up to 24x higher throughput, compared with the state-of-the-art FPGA\nimplementations for VGG16.\n","authors":["Yue Niu","Hanqing Zeng","Ajitesh Srivastava","Kartik Lakhotia","Rajgopal Kannan","Yanzhi Wang","Viktor Prasanna"],"pdf_url":"https://arxiv.org/pdf/1910.11103v2.pdf","comment":"This is a 10-page conference paper in 26TH IEEE International\n Conference On High Performance Computing, Data, and Analytics (HiPC)"},{"id":"http://arxiv.org/abs/2306.06323v2","updated":"2023-10-11T23:40:03Z","published":"2023-06-10T00:27:37Z","title":"Learning Joint Latent Space EBM Prior Model for Multi-layer Generator","summary":" This paper studies the fundamental problem of learning multi-layer generator\nmodels. The multi-layer generator model builds multiple layers of latent\nvariables as a prior model on top of the generator, which benefits learning\ncomplex data distribution and hierarchical representations. However, such a\nprior model usually focuses on modeling inter-layer relations between latent\nvariables by assuming non-informative (conditional) Gaussian distributions,\nwhich can be limited in model expressivity. To tackle this issue and learn more\nexpressive prior models, we propose an energy-based model (EBM) on the joint\nlatent space over all layers of latent variables with the multi-layer generator\nas its backbone. Such joint latent space EBM prior model captures the\nintra-layer contextual relations at each layer through layer-wise energy terms,\nand latent variables across different layers are jointly corrected. We develop\na joint training scheme via maximum likelihood estimation (MLE), which involves\nMarkov Chain Monte Carlo (MCMC) sampling for both prior and posterior\ndistributions of the latent variables from different layers. To ensure\nefficient inference and learning, we further propose a variational training\nscheme where an inference model is used to amortize the costly posterior MCMC\nsampling. Our experiments demonstrate that the learned model can be expressive\nin generating high-quality images and capturing hierarchical features for\nbetter outlier detection.\n","authors":["Jiali Cui","Ying Nian Wu","Tian Han"],"pdf_url":"https://arxiv.org/pdf/2306.06323v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.03135v3","updated":"2023-10-11T23:38:36Z","published":"2023-07-06T17:05:26Z","title":"Distilling Large Vision-Language Model with Out-of-Distribution\n Generalizability","summary":" Large vision-language models have achieved outstanding performance, but their\nsize and computational requirements make their deployment on\nresource-constrained devices and time-sensitive tasks impractical. Model\ndistillation, the process of creating smaller, faster models that maintain the\nperformance of larger models, is a promising direction towards the solution.\nThis paper investigates the distillation of visual representations in large\nteacher vision-language models into lightweight student models using a small-\nor mid-scale dataset. Notably, this study focuses on open-vocabulary\nout-of-distribution (OOD) generalization, a challenging problem that has been\noverlooked in previous model distillation literature. We propose two principles\nfrom vision and language modality perspectives to enhance student's OOD\ngeneralization: (1) by better imitating teacher's visual representation space,\nand carefully promoting better coherence in vision-language alignment with the\nteacher; (2) by enriching the teacher's language representations with\ninformative and finegrained semantic attributes to effectively distinguish\nbetween different labels. We propose several metrics and conduct extensive\nexperiments to investigate their techniques. The results demonstrate\nsignificant improvements in zero-shot and few-shot student performance on\nopen-vocabulary out-of-distribution classification, highlighting the\neffectiveness of our proposed approaches. Poster:\nhttps://xuanlinli17.github.io/pdfs/iccv23_large_vlm_distillation_poster.pdf\nCode: https://github.com/xuanlinli17/large_vlm_distillation_ood\n","authors":["Xuanlin Li","Yunhao Fang","Minghua Liu","Zhan Ling","Zhuowen Tu","Hao Su"],"pdf_url":"https://arxiv.org/pdf/2307.03135v3.pdf","comment":"Published at International Conference on Computer Vision (ICCV) 2023.\n Poster at\n https://xuanlinli17.github.io/pdfs/iccv23_large_vlm_distillation_poster.pdf"},{"id":"http://arxiv.org/abs/2310.07932v1","updated":"2023-10-11T23:04:07Z","published":"2023-10-11T23:04:07Z","title":"What Matters to You? Towards Visual Representation Alignment for Robot\n Learning","summary":" When operating in service of people, robots need to optimize rewards aligned\nwith end-user preferences. Since robots will rely on raw perceptual inputs like\nRGB images, their rewards will inevitably use visual representations. Recently\nthere has been excitement in using representations from pre-trained visual\nmodels, but key to making these work in robotics is fine-tuning, which is\ntypically done via proxy tasks like dynamics prediction or enforcing temporal\ncycle-consistency. However, all these proxy tasks bypass the human's input on\nwhat matters to them, exacerbating spurious correlations and ultimately leading\nto robot behaviors that are misaligned with user preferences. In this work, we\npropose that robots should leverage human feedback to align their visual\nrepresentations with the end-user and disentangle what matters for the task. We\npropose Representation-Aligned Preference-based Learning (RAPL), a method for\nsolving the visual representation alignment problem and visual reward learning\nproblem through the lens of preference-based learning and optimal transport.\nAcross experiments in X-MAGICAL and in robotic manipulation, we find that\nRAPL's reward consistently generates preferred robot behaviors with high sample\nefficiency, and shows strong zero-shot generalization when the visual\nrepresentation is learned from a different embodiment than the robot's.\n","authors":["Ran Tian","Chenfeng Xu","Masayoshi Tomizuka","Jitendra Malik","Andrea Bajcsy"],"pdf_url":"https://arxiv.org/pdf/2310.07932v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07931v1","updated":"2023-10-11T23:01:29Z","published":"2023-10-11T23:01:29Z","title":"D2 Pruning: Message Passing for Balancing Diversity and Difficulty in\n Data Pruning","summary":" Analytical theories suggest that higher-quality data can lead to lower test\nerrors in models trained on a fixed data budget. Moreover, a model can be\ntrained on a lower compute budget without compromising performance if a dataset\ncan be stripped of its redundancies. Coreset selection (or data pruning) seeks\nto select a subset of the training data so as to maximize the performance of\nmodels trained on this subset, also referred to as coreset. There are two\ndominant approaches: (1) geometry-based data selection for maximizing data\ndiversity in the coreset, and (2) functions that assign difficulty scores to\nsamples based on training dynamics. Optimizing for data diversity leads to a\ncoreset that is biased towards easier samples, whereas, selection by difficulty\nranking omits easy samples that are necessary for the training of deep learning\nmodels. This demonstrates that data diversity and importance scores are two\ncomplementary factors that need to be jointly considered during coreset\nselection. We represent a dataset as an undirected graph and propose a novel\npruning algorithm, D2 Pruning, that uses forward and reverse message passing\nover this dataset graph for coreset selection. D2 Pruning updates the\ndifficulty scores of each example by incorporating the difficulty of its\nneighboring examples in the dataset graph. Then, these updated difficulty\nscores direct a graph-based sampling method to select a coreset that\nencapsulates both diverse and difficult regions of the dataset space. We\nevaluate supervised and self-supervised versions of our method on various\nvision and language datasets. Results show that D2 Pruning improves coreset\nselection over previous state-of-the-art methods for up to 70% pruning rates.\nAdditionally, we find that using D2 Pruning for filtering large multimodal\ndatasets leads to increased diversity in the dataset and improved\ngeneralization of pretrained models.\n","authors":["Adyasha Maharana","Prateek Yadav","Mohit Bansal"],"pdf_url":"https://arxiv.org/pdf/2310.07931v1.pdf","comment":"17 pages (Our code is available at\n https://github.com/adymaharana/d2pruning)"},{"id":"http://arxiv.org/abs/2308.12364v2","updated":"2023-10-11T22:38:38Z","published":"2023-08-23T18:08:32Z","title":"Saliency-based Video Summarization for Face Anti-spoofing","summary":" With the growing availability of databases for face presentation attack\ndetection, researchers are increasingly focusing on video-based face\nanti-spoofing methods that involve hundreds to thousands of images for training\nthe models. However, there is currently no clear consensus on the optimal\nnumber of frames in a video to improve face spoofing detection. Inspired by the\nvisual saliency theory, we present a video summarization method for face\nanti-spoofing detection that aims to enhance the performance and efficiency of\ndeep learning models by leveraging visual saliency. In particular, saliency\ninformation is extracted from the differences between the Laplacian and Wiener\nfilter outputs of the source images, enabling identification of the most\nvisually salient regions within each frame. Subsequently, the source images are\ndecomposed into base and detail images, enhancing the representation of the\nmost important information. Weighting maps are then computed based on the\nsaliency information, indicating the importance of each pixel in the image. By\nlinearly combining the base and detail images using the weighting maps, the\nmethod fuses the source images to create a single representative image that\nsummarizes the entire video. The key contribution of the proposed method lies\nin demonstrating how visual saliency can be used as a data-centric approach to\nimprove the performance and efficiency for face presentation attack detection.\nBy focusing on the most salient images or regions within the images, a more\nrepresentative and diverse training set can be created, potentially leading to\nmore effective models. To validate the method's effectiveness, a simple CNN-RNN\ndeep learning architecture was used, and the experimental results showcased\nstate-of-the-art performance on five challenging face anti-spoofing datasets\n","authors":["Usman Muhammad","Mourad Oussalah","Jorma Laaksonen"],"pdf_url":"https://arxiv.org/pdf/2308.12364v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.14616v3","updated":"2023-10-11T22:15:54Z","published":"2023-09-26T02:09:52Z","title":"NDC-Scene: Boost Monocular 3D Semantic Scene Completion in Normalized\n Device Coordinates Space","summary":" Monocular 3D Semantic Scene Completion (SSC) has garnered significant\nattention in recent years due to its potential to predict complex semantics and\ngeometry shapes from a single image, requiring no 3D inputs. In this paper, we\nidentify several critical issues in current state-of-the-art methods, including\nthe Feature Ambiguity of projected 2D features in the ray to the 3D space, the\nPose Ambiguity of the 3D convolution, and the Computation Imbalance in the 3D\nconvolution across different depth levels. To address these problems, we devise\na novel Normalized Device Coordinates scene completion network (NDC-Scene) that\ndirectly extends the 2D feature map to a Normalized Device Coordinates (NDC)\nspace, rather than to the world space directly, through progressive restoration\nof the dimension of depth with deconvolution operations. Experiment results\ndemonstrate that transferring the majority of computation from the target 3D\nspace to the proposed normalized device coordinates space benefits monocular\nSSC tasks. Additionally, we design a Depth-Adaptive Dual Decoder to\nsimultaneously upsample and fuse the 2D and 3D feature maps, further improving\noverall performance. Our extensive experiments confirm that the proposed method\nconsistently outperforms state-of-the-art methods on both outdoor SemanticKITTI\nand indoor NYUv2 datasets. Our code are available at\nhttps://github.com/Jiawei-Yao0812/NDCScene.\n","authors":["Jiawei Yao","Chuming Li","Keqiang Sun","Yingjie Cai","Hao Li","Wanli Ouyang","Hongsheng Li"],"pdf_url":"https://arxiv.org/pdf/2309.14616v3.pdf","comment":"Accepted at ICCV 2023. Project page:\n https://jiawei-yao0812.github.io/NDC-Scene/"},{"id":"http://arxiv.org/abs/2310.07916v1","updated":"2023-10-11T22:04:33Z","published":"2023-10-11T22:04:33Z","title":"Dynamic Appearance Particle Neural Radiance Field","summary":" Neural Radiance Fields (NeRFs) have shown great potential in modelling 3D\nscenes. Dynamic NeRFs extend this model by capturing time-varying elements,\ntypically using deformation fields. The existing dynamic NeRFs employ a similar\nEulerian representation for both light radiance and deformation fields. This\nleads to a close coupling of appearance and motion and lacks a physical\ninterpretation. In this work, we propose Dynamic Appearance Particle Neural\nRadiance Field (DAP-NeRF), which introduces particle-based representation to\nmodel the motions of visual elements in a dynamic 3D scene. DAP-NeRF consists\nof superposition of a static field and a dynamic field. The dynamic field is\nquantised as a collection of {\\em appearance particles}, which carries the\nvisual information of a small dynamic element in the scene and is equipped with\na motion model. All components, including the static field, the visual features\nand motion models of the particles, are learned from monocular videos without\nany prior geometric knowledge of the scene. We develop an efficient\ncomputational framework for the particle-based model. We also construct a new\ndataset to evaluate motion modelling. Experimental results show that DAP-NeRF\nis an effective technique to capture not only the appearance but also the\nphysically meaningful motions in a 3D dynamic scene.\n","authors":["Ancheng Lin","Jun Li"],"pdf_url":"https://arxiv.org/pdf/2310.07916v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07896v1","updated":"2023-10-11T21:07:14Z","published":"2023-10-11T21:07:14Z","title":"NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration","summary":" Robotic learning for navigation in unfamiliar environments needs to provide\npolicies for both task-oriented navigation (i.e., reaching a goal that the\nrobot has located), and task-agnostic exploration (i.e., searching for a goal\nin a novel setting). Typically, these roles are handled by separate models, for\nexample by using subgoal proposals, planning, or separate navigation\nstrategies. In this paper, we describe how we can train a single unified\ndiffusion policy to handle both goal-directed navigation and goal-agnostic\nexploration, with the latter providing the ability to search novel\nenvironments, and the former providing the ability to reach a user-specified\ngoal once it has been located. We show that this unified policy results in\nbetter overall performance when navigating to visually indicated goals in novel\nenvironments, as compared to approaches that use subgoal proposals from\ngenerative models, or prior methods based on latent variable models. We\ninstantiate our method by using a large-scale Transformer-based policy trained\non data from multiple ground robots, with a diffusion model decoder to flexibly\nhandle both goal-conditioned and goal-agnostic navigation. Our experiments,\nconducted on a real-world mobile robot platform, show effective navigation in\nunseen environments in comparison with five alternative methods, and\ndemonstrate significant improvements in performance and lower collision rates,\ndespite utilizing smaller models than state-of-the-art approaches. For more\nvideos, code, and pre-trained model checkpoints, see\nhttps://general-navigation-models.github.io/nomad/\n","authors":["Ajay Sridhar","Dhruv Shah","Catherine Glossop","Sergey Levine"],"pdf_url":"https://arxiv.org/pdf/2310.07896v1.pdf","comment":"Project page https://general-navigation-models.github.io/nomad/"},{"id":"http://arxiv.org/abs/2310.07894v1","updated":"2023-10-11T21:04:42Z","published":"2023-10-11T21:04:42Z","title":"Efficient Integrators for Diffusion Generative Models","summary":" Diffusion models suffer from slow sample generation at inference time.\nTherefore, developing a principled framework for fast deterministic/stochastic\nsampling for a broader class of diffusion models is a promising direction. We\npropose two complementary frameworks for accelerating sample generation in\npre-trained models: Conjugate Integrators and Splitting Integrators. Conjugate\nintegrators generalize DDIM, mapping the reverse diffusion dynamics to a more\namenable space for sampling. In contrast, splitting-based integrators, commonly\nused in molecular dynamics, reduce the numerical simulation error by cleverly\nalternating between numerical updates involving the data and auxiliary\nvariables. After extensively studying these methods empirically and\ntheoretically, we present a hybrid method that leads to the best-reported\nperformance for diffusion models in augmented spaces. Applied to Phase Space\nLangevin Diffusion [Pandey & Mandt, 2023] on CIFAR-10, our deterministic and\nstochastic samplers achieve FID scores of 2.11 and 2.36 in only 100 network\nfunction evaluations (NFE) as compared to 2.57 and 2.63 for the best-performing\nbaselines, respectively. Our code and model checkpoints will be made publicly\navailable at \\url{https://github.com/mandt-lab/PSLD}.\n","authors":["Kushagra Pandey","Maja Rudolph","Stephan Mandt"],"pdf_url":"https://arxiv.org/pdf/2310.07894v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07889v1","updated":"2023-10-11T20:52:30Z","published":"2023-10-11T20:52:30Z","title":"LangNav: Language as a Perceptual Representation for Navigation","summary":" We explore the use of language as a perceptual representation for\nvision-and-language navigation. Our approach uses off-the-shelf vision systems\n(for image captioning and object detection) to convert an agent's egocentric\npanoramic view at each time step into natural language descriptions. We then\nfinetune a pretrained language model to select an action, based on the current\nview and the trajectory history, that would best fulfill the navigation\ninstructions. In contrast to the standard setup which adapts a pretrained\nlanguage model to work directly with continuous visual features from pretrained\nvision models, our approach instead uses (discrete) language as the perceptual\nrepresentation. We explore two use cases of our language-based navigation\n(LangNav) approach on the R2R vision-and-language navigation benchmark:\ngenerating synthetic trajectories from a prompted large language model (GPT-4)\nwith which to finetune a smaller language model; and sim-to-real transfer where\nwe transfer a policy learned on a simulated environment (ALFRED) to a\nreal-world environment (R2R). Our approach is found to improve upon strong\nbaselines that rely on visual features in settings where only a few gold\ntrajectories (10-100) are available, demonstrating the potential of using\nlanguage as a perceptual representation for navigation tasks.\n","authors":["Bowen Pan","Rameswar Panda","SouYoung Jin","Rogerio Feris","Aude Oliva","Phillip Isola","Yoon Kim"],"pdf_url":"https://arxiv.org/pdf/2310.07889v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07887v1","updated":"2023-10-11T20:48:20Z","published":"2023-10-11T20:48:20Z","title":"Unsupervised Structured Noise Removal with Variational Lossy Autoencoder","summary":" Most unsupervised denoising methods are based on the assumption that imaging\nnoise is either pixel-independent, i.e., spatially uncorrelated, or\nsignal-independent, i.e., purely additive. However, in practice many imaging\nsetups, especially in microscopy, suffer from a combination of signal-dependent\nnoise (e.g. Poisson shot noise) and axis-aligned correlated noise (e.g. stripe\nshaped scanning or readout artifacts). In this paper, we present the first\nunsupervised deep learning-based denoiser that can remove this type of noise\nwithout access to any clean images or a noise model. Unlike self-supervised\ntechniques, our method does not rely on removing pixels by masking or\nsubsampling so can utilize all available information. We implement a\nVariational Autoencoder (VAE) with a specially designed autoregressive decoder\ncapable of modelling the noise component of an image but incapable of\nindependently modelling the underlying clean signal component. As a\nconsequence, our VAE's encoder learns to encode only underlying clean signal\ncontent and to discard imaging noise. We also propose an additional decoder for\nmapping the encoder's latent variables back into image space, thereby sampling\ndenoised images. Experimental results demonstrate that our approach surpasses\nexisting methods for self- and unsupervised image denoising while being robust\nwith respect to the size of the autoregressive receptive field. Code for this\nproject can be found at https://github.com/krulllab/DVLAE.\n","authors":["Benjamin Salmon","Alexander Krull"],"pdf_url":"https://arxiv.org/pdf/2310.07887v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07886v1","updated":"2023-10-11T20:48:19Z","published":"2023-10-11T20:48:19Z","title":"A Survey of Feature Types and Their Contributions for Camera Tampering\n Detection","summary":" Camera tamper detection is the ability to detect unauthorized and\nunintentional alterations in surveillance cameras by analyzing the video.\nCamera tampering can occur due to natural events or it can be caused\nintentionally to disrupt surveillance. We cast tampering detection as a change\ndetection problem, and perform a review of the existing literature with\nemphasis on feature types. We formulate tampering detection as a time series\nanalysis problem, and design experiments to study the robustness and capability\nof various feature types. We compute ten features on real-world surveillance\nvideo and apply time series analysis to ascertain their predictability, and\ntheir capability to detect tampering. Finally, we quantify the performance of\nvarious time series models using each feature type to detect tampering.\n","authors":["Pranav Mantini","Shishir K. Shah"],"pdf_url":"https://arxiv.org/pdf/2310.07886v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.16319v2","updated":"2023-10-11T20:33:37Z","published":"2023-05-25T17:59:50Z","title":"Image as First-Order Norm+Linear Autoregression: Unveiling Mathematical\n Invariance","summary":" This paper introduces a novel mathematical property applicable to diverse\nimages, referred to as FINOLA (First-Order Norm+Linear Autoregressive). FINOLA\nrepresents each image in the latent space as a first-order autoregressive\nprocess, in which each regression step simply applies a shared linear model on\nthe normalized value of its immediate neighbor. This intriguing property\nreveals a mathematical invariance that transcends individual images. Expanding\nfrom image grids to continuous coordinates, we unveil the presence of two\nunderlying partial differential equations. We validate the FINOLA property from\ntwo distinct angles: image reconstruction and self-supervised learning.\nFirstly, we demonstrate the ability of FINOLA to auto-regress up to a 256x256\nfeature map (the same resolution to the image) from a single vector placed at\nthe center, successfully reconstructing the original image by only using three\n3x3 convolution layers as decoder. Secondly, we leverage FINOLA for\nself-supervised learning by employing a simple masked prediction approach.\nEncoding a single unmasked quadrant block, we autoregressively predict the\nsurrounding masked region. Remarkably, this pre-trained representation proves\nhighly effective in image classification and object detection tasks, even when\nintegrated into lightweight networks, all without the need for extensive\nfine-tuning. The code will be made publicly available.\n","authors":["Yinpeng Chen","Xiyang Dai","Dongdong Chen","Mengchen Liu","Lu Yuan","Zicheng Liu","Youzuo Lin"],"pdf_url":"https://arxiv.org/pdf/2305.16319v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.01748v2","updated":"2023-10-11T20:28:58Z","published":"2023-03-03T07:20:58Z","title":"A Complete Recipe for Diffusion Generative Models","summary":" Score-based Generative Models (SGMs) have demonstrated exceptional synthesis\noutcomes across various tasks. However, the current design landscape of the\nforward diffusion process remains largely untapped and often relies on physical\nheuristics or simplifying assumptions. Utilizing insights from the development\nof scalable Bayesian posterior samplers, we present a complete recipe for\nformulating forward processes in SGMs, ensuring convergence to the desired\ntarget distribution. Our approach reveals that several existing SGMs can be\nseen as specific manifestations of our framework. Building upon this method, we\nintroduce Phase Space Langevin Diffusion (PSLD), which relies on score-based\nmodeling within an augmented space enriched by auxiliary variables akin to\nphysical phase space. Empirical results exhibit the superior sample quality and\nimproved speed-quality trade-off of PSLD compared to various competing\napproaches on established image synthesis benchmarks. Remarkably, PSLD achieves\nsample quality akin to state-of-the-art SGMs (FID: 2.10 for unconditional\nCIFAR-10 generation). Lastly, we demonstrate the applicability of PSLD in\nconditional synthesis using pre-trained score networks, offering an appealing\nalternative as an SGM backbone for future advancements. Code and model\ncheckpoints can be accessed at \\url{https://github.com/mandt-lab/PSLD}.\n","authors":["Kushagra Pandey","Stephan Mandt"],"pdf_url":"https://arxiv.org/pdf/2303.01748v2.pdf","comment":"Accepted in ICCV'23 (Oral Presentation)"},{"id":"http://arxiv.org/abs/2210.09222v2","updated":"2023-10-11T19:59:02Z","published":"2022-10-14T08:05:16Z","title":"MMTSA: Multimodal Temporal Segment Attention Network for Efficient Human\n Activity Recognition","summary":" Multimodal sensors provide complementary information to develop accurate\nmachine-learning methods for human activity recognition (HAR), but introduce\nsignificantly higher computational load, which reduces efficiency. This paper\nproposes an efficient multimodal neural architecture for HAR using an RGB\ncamera and inertial measurement units (IMUs) called Multimodal Temporal Segment\nAttention Network (MMTSA). MMTSA first transforms IMU sensor data into a\ntemporal and structure-preserving gray-scale image using the Gramian Angular\nField (GAF), representing the inherent properties of human activities. MMTSA\nthen applies a multimodal sparse sampling method to reduce data redundancy.\nLastly, MMTSA adopts an inter-segment attention module for efficient multimodal\nfusion. Using three well-established public datasets, we evaluated MMTSA's\neffectiveness and efficiency in HAR. Results show that our method achieves\nsuperior performance improvements 11.13% of cross-subject F1-score on the MMAct\ndataset than the previous state-of-the-art (SOTA) methods. The ablation study\nand analysis suggest that MMTSA's effectiveness in fusing multimodal data for\naccurate HAR. The efficiency evaluation on an edge device showed that MMTSA\nachieved significantly better accuracy, lower computational load, and lower\ninference latency than SOTA methods.\n","authors":["Ziqi Gao","Yuntao Wang","Jianguo Chen","Junliang Xing","Shwetak Patel","Xin Liu","Yuanchun Shi"],"pdf_url":"https://arxiv.org/pdf/2210.09222v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07855v1","updated":"2023-10-11T19:57:51Z","published":"2023-10-11T19:57:51Z","title":"CrIBo: Self-Supervised Learning via Cross-Image Object-Level\n Bootstrapping","summary":" Leveraging nearest neighbor retrieval for self-supervised representation\nlearning has proven beneficial with object-centric images. However, this\napproach faces limitations when applied to scene-centric datasets, where\nmultiple objects within an image are only implicitly captured in the global\nrepresentation. Such global bootstrapping can lead to undesirable entanglement\nof object representations. Furthermore, even object-centric datasets stand to\nbenefit from a finer-grained bootstrapping approach. In response to these\nchallenges, we introduce a novel Cross-Image Object-Level Bootstrapping method\ntailored to enhance dense visual representation learning. By employing\nobject-level nearest neighbor bootstrapping throughout the training, CrIBo\nemerges as a notably strong and adequate candidate for in-context learning,\nleveraging nearest neighbor retrieval at test time. CrIBo shows\nstate-of-the-art performance on the latter task while being highly competitive\nin more standard downstream segmentation tasks. Our code and pretrained models\nwill be publicly available upon acceptance.\n","authors":["Tim Lebailly","Thomas Stegmüller","Behzad Bozorgtabar","Jean-Philippe Thiran","Tinne Tuytelaars"],"pdf_url":"https://arxiv.org/pdf/2310.07855v1.pdf","comment":null}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2310.07713v1","updated":"2023-10-11T17:59:05Z","published":"2023-10-11T17:59:05Z","title":"InstructRetro: Instruction Tuning post Retrieval-Augmented Pretraining","summary":" Pretraining auto-regressive large language models (LLMs) with retrieval\ndemonstrates better perplexity and factual accuracy by leveraging external\ndatabases. However, the size of existing pretrained retrieval-augmented LLM is\nstill limited (e.g., Retro has 7.5B parameters), which limits the effectiveness\nof instruction tuning and zero-shot generalization. In this work, we introduce\nRetro 48B, the largest LLM pretrained with retrieval before instruction tuning.\nSpecifically, we continue to pretrain the 43B GPT model on additional 100\nbillion tokens using the Retro augmentation method by retrieving from 1.2\ntrillion tokens. The obtained foundation model, Retro 48B, largely outperforms\nthe original 43B GPT in terms of perplexity. After instruction tuning on Retro,\nInstructRetro demonstrates significant improvement over the instruction tuned\nGPT on zero-shot question answering (QA) tasks. Specifically, the average\nimprovement of InstructRetro is 7% over its GPT counterpart across 8 short-form\nQA tasks, and 10% over GPT across 4 challenging long-form QA tasks.\nSurprisingly, we find that one can ablate the encoder from InstructRetro\narchitecture and directly use its decoder backbone, while achieving comparable\nresults. We hypothesize that pretraining with retrieval makes its decoder good\nat incorporating context for QA. Our results highlights the promising direction\nto obtain a better GPT decoder for QA through continued pretraining with\nretrieval before instruction tuning.\n","authors":["Boxin Wang","Wei Ping","Lawrence McAfee","Peng Xu","Bo Li","Mohammad Shoeybi","Bryan Catanzaro"],"pdf_url":"https://arxiv.org/pdf/2310.07713v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.08703v3","updated":"2023-10-11T17:00:34Z","published":"2023-05-15T15:06:20Z","title":"Schema-adaptable Knowledge Graph Construction","summary":" Conventional Knowledge Graph Construction (KGC) approaches typically follow\nthe static information extraction paradigm with a closed set of pre-defined\nschema. As a result, such approaches fall short when applied to dynamic\nscenarios or domains, whereas a new type of knowledge emerges. This\nnecessitates a system that can handle evolving schema automatically to extract\ninformation for KGC. To address this need, we propose a new task called\nschema-adaptable KGC, which aims to continually extract entity, relation, and\nevent based on a dynamically changing schema graph without re-training. We\nfirst split and convert existing datasets based on three principles to build a\nbenchmark, i.e., horizontal schema expansion, vertical schema expansion, and\nhybrid schema expansion; then investigate the schema-adaptable performance of\nseveral well-known approaches such as Text2Event, TANL, UIE and GPT-3.5. We\nfurther propose a simple yet effective baseline dubbed \\textsc{AdaKGC}, which\ncontains schema-enriched prefix instructor and schema-conditioned dynamic\ndecoding to better handle evolving schema. Comprehensive experimental results\nillustrate that AdaKGC can outperform baselines but still have room for\nimprovement. We hope the proposed work can deliver benefits to the community.\nCode and datasets available at https://github.com/zjunlp/AdaKGC.\n","authors":["Hongbin Ye","Honghao Gui","Xin Xu","Huajun Chen","Ningyu Zhang"],"pdf_url":"https://arxiv.org/pdf/2305.08703v3.pdf","comment":"EMNLP 2023 (Findings)"},{"id":"http://arxiv.org/abs/2305.13172v2","updated":"2023-10-11T16:51:50Z","published":"2023-05-22T16:00:00Z","title":"Editing Large Language Models: Problems, Methods, and Opportunities","summary":" Despite the ability to train capable LLMs, the methodology for maintaining\ntheir relevancy and rectifying errors remains elusive. To this end, the past\nfew years have witnessed a surge in techniques for editing LLMs, the objective\nof which is to efficiently alter the behavior of LLMs within a specific domain\nwithout negatively impacting performance across other inputs. This paper\nembarks on a deep exploration of the problems, methods, and opportunities\nrelated to model editing for LLMs. In particular, we provide an exhaustive\noverview of the task definition and challenges associated with model editing,\nalong with an in-depth empirical analysis of the most progressive methods\ncurrently at our disposal. We also build a new benchmark dataset to facilitate\na more robust evaluation and pinpoint enduring issues intrinsic to existing\ntechniques. Our objective is to provide valuable insights into the\neffectiveness and feasibility of each editing technique, thereby assisting the\ncommunity in making informed decisions on the selection of the most appropriate\nmethod for a specific task or context. Code and datasets are available at\nhttps://github.com/zjunlp/EasyEdit.\n","authors":["Yunzhi Yao","Peng Wang","Bozhong Tian","Siyuan Cheng","Zhoubo Li","Shumin Deng","Huajun Chen","Ningyu Zhang"],"pdf_url":"https://arxiv.org/pdf/2305.13172v2.pdf","comment":"EMNLP 2023. Updated with new experiments"},{"id":"http://arxiv.org/abs/2211.15743v3","updated":"2023-10-11T16:18:59Z","published":"2022-11-28T19:49:02Z","title":"Towards Reliable Item Sampling for Recommendation Evaluation","summary":" Since Rendle and Krichene argued that commonly used sampling-based evaluation\nmetrics are \"inconsistent\" with respect to the global metrics (even in\nexpectation), there have been a few studies on the sampling-based recommender\nsystem evaluation. Existing methods try either mapping the sampling-based\nmetrics to their global counterparts or more generally, learning the empirical\nrank distribution to estimate the top-$K$ metrics. However, despite existing\nefforts, there is still a lack of rigorous theoretical understanding of the\nproposed metric estimators, and the basic item sampling also suffers from the\n\"blind spot\" issue, i.e., estimation accuracy to recover the top-$K$ metrics\nwhen $K$ is small can still be rather substantial. In this paper, we provide an\nin-depth investigation into these problems and make two innovative\ncontributions. First, we propose a new item-sampling estimator that explicitly\noptimizes the error with respect to the ground truth, and theoretically\nhighlight its subtle difference against prior work. Second, we propose a new\nadaptive sampling method which aims to deal with the \"blind spot\" problem and\nalso demonstrate the expectation-maximization (EM) algorithm can be generalized\nfor such a setting. Our experimental results confirm our statistical analysis\nand the superiority of the proposed works. This study helps lay the theoretical\nfoundation for adopting item sampling metrics for recommendation evaluation,\nand provides strong evidence towards making item sampling a powerful and\nreliable tool for recommendation evaluation.\n","authors":["Dong Li","Ruoming Jin","Zhenming Liu","Bin Ren","Jing Gao","Zhi Liu"],"pdf_url":"https://arxiv.org/pdf/2211.15743v3.pdf","comment":"aaai2023"},{"id":"http://arxiv.org/abs/2310.07554v1","updated":"2023-10-11T14:59:53Z","published":"2023-10-11T14:59:53Z","title":"Retrieve Anything To Augment Large Language Models","summary":" Large language models (LLMs) face significant challenges stemming from the\ninherent limitations in knowledge, memory, alignment, and action. These\nchallenges cannot be addressed by LLMs alone, but should rely on assistance\nfrom the external world, such as knowledge base, memory store, demonstration\nexamples, and tools. Retrieval augmentation stands as a vital mechanism for\nbridging the gap between LLMs and the external assistance. However,\nconventional methods encounter two pressing issues. On one hand, the\ngeneral-purpose retrievers are not properly optimized for the retrieval\naugmentation of LLMs. On the other hand, the task-specific retrievers lack the\nrequired versatility, hindering their performance across the diverse retrieval\naugmentation scenarios.\n In this work, we present a novel approach, the LLM Embedder, which\ncomprehensively support the diverse needs of LLMs' retrieval augmentation with\none unified embedding model. Training such an unified model is non-trivial, as\nvarious retrieval tasks aim to capture distinct semantic relationships, often\nsubject to mutual interference. To address this challenge, we systematically\noptimize our training methodology. This includes reward formulation based on\nLLMs' feedback, the stabilization of knowledge distillation, multi-task\nfine-tuning with explicit instructions, and the use of homogeneous in-batch\nnegative sampling. These optimization strategies contribute to the outstanding\nempirical performance of the LLM-Embedder. Notably, it yields remarkable\nenhancements in retrieval augmentation for LLMs, surpassing both\ngeneral-purpose and task-specific retrievers in various evaluation scenarios.\nThis project is made publicly available at\nhttps://github.com/FlagOpen/FlagEmbedding.\n","authors":["Peitian Zhang","Shitao Xiao","Zheng Liu","Zhicheng Dou","Jian-Yun Nie"],"pdf_url":"https://arxiv.org/pdf/2310.07554v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09447v3","updated":"2023-10-11T13:46:25Z","published":"2023-07-18T17:22:19Z","title":"Deep Neural Aggregation for Recommending Items to Group of Users","summary":" Modern society devotes a significant amount of time to digital interaction.\nMany of our daily actions are carried out through digital means. This has led\nto the emergence of numerous Artificial Intelligence tools that assist us in\nvarious aspects of our lives. One key tool for the digital society is\nRecommender Systems, intelligent systems that learn from our past actions to\npropose new ones that align with our interests. Some of these systems have\nspecialized in learning from the behavior of user groups to make\nrecommendations to a group of individuals who want to perform a joint task. In\nthis article, we analyze the current state of Group Recommender Systems and\npropose two new models that use emerging Deep Learning architectures.\nExperimental results demonstrate the improvement achieved by employing the\nproposed models compared to the state-of-the-art models using four different\ndatasets. The source code of the models, as well as that of all the experiments\nconducted, is available in a public repository.\n","authors":["Jorge Dueñas-Lerín","Raúl Lara-Cabrera","Fernando Ortega","Jesús Bobadilla"],"pdf_url":"https://arxiv.org/pdf/2307.09447v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07477v1","updated":"2023-10-11T13:24:38Z","published":"2023-10-11T13:24:38Z","title":"GMOCAT: A Graph-Enhanced Multi-Objective Method for Computerized\n Adaptive Testing","summary":" Computerized Adaptive Testing(CAT) refers to an online system that adaptively\nselects the best-suited question for students with various abilities based on\ntheir historical response records. Most CAT methods only focus on the quality\nobjective of predicting the student ability accurately, but neglect concept\ndiversity or question exposure control, which are important considerations in\nensuring the performance and validity of CAT. Besides, the students' response\nrecords contain valuable relational information between questions and knowledge\nconcepts. The previous methods ignore this relational information, resulting in\nthe selection of sub-optimal test questions. To address these challenges, we\npropose a Graph-Enhanced Multi-Objective method for CAT (GMOCAT). Firstly,\nthree objectives, namely quality, diversity and novelty, are introduced into\nthe Scalarized Multi-Objective Reinforcement Learning framework of CAT, which\nrespectively correspond to improving the prediction accuracy, increasing the\nconcept diversity and reducing the question exposure. We use an Actor-Critic\nRecommender to select questions and optimize three objectives simultaneously by\nthe scalarization function. Secondly, we utilize the graph neural network to\nlearn relation-aware embeddings of questions and concepts. These embeddings are\nable to aggregate neighborhood information in the relation graphs between\nquestions and concepts. We conduct experiments on three real-world educational\ndatasets, and show that GMOCAT not only outperforms the state-of-the-art\nmethods in the ability prediction, but also achieve superior performance in\nimproving the concept diversity and alleviating the question exposure. Our code\nis available at https://github.com/justarter/GMOCAT.\n","authors":["Hangyu Wang","Ting Long","Liang Yin","Weinan Zhang","Wei Xia","Qichen Hong","Dingyin Xia","Ruiming Tang","Yong Yu"],"pdf_url":"https://arxiv.org/pdf/2310.07477v1.pdf","comment":"KDD23"},{"id":"http://arxiv.org/abs/2303.09902v2","updated":"2023-10-11T12:48:30Z","published":"2023-03-17T11:39:35Z","title":"Contrastive Self-supervised Learning in Recommender Systems: A Survey","summary":" Deep learning-based recommender systems have achieved remarkable success in\nrecent years. However, these methods usually heavily rely on labeled data\n(i.e., user-item interactions), suffering from problems such as data sparsity\nand cold-start. Self-supervised learning, an emerging paradigm that extracts\ninformation from unlabeled data, provides insights into addressing these\nproblems. Specifically, contrastive self-supervised learning, due to its\nflexibility and promising performance, has attracted considerable interest and\nrecently become a dominant branch in self-supervised learning-based\nrecommendation methods. In this survey, we provide an up-to-date and\ncomprehensive review of current contrastive self-supervised learning-based\nrecommendation methods. Firstly, we propose a unified framework for these\nmethods. We then introduce a taxonomy based on the key components of the\nframework, including view generation strategy, contrastive task, and\ncontrastive objective. For each component, we provide detailed descriptions and\ndiscussions to guide the choice of the appropriate method. Finally, we outline\nopen issues and promising directions for future research.\n","authors":["Mengyuan Jing","Yanmin Zhu","Tianzi Zang","Ke Wang"],"pdf_url":"https://arxiv.org/pdf/2303.09902v2.pdf","comment":"Accepted by ACM Transactions on Information Systems (TOIS)"},{"id":"http://arxiv.org/abs/2305.08732v3","updated":"2023-10-11T10:51:12Z","published":"2023-05-15T15:47:09Z","title":"Knowledge Rumination for Pre-trained Language Models","summary":" Previous studies have revealed that vanilla pre-trained language models\n(PLMs) lack the capacity to handle knowledge-intensive NLP tasks alone; thus,\nseveral works have attempted to integrate external knowledge into PLMs.\nHowever, despite the promising outcome, we empirically observe that PLMs may\nhave already encoded rich knowledge in their pre-trained parameters but fail to\nfully utilize them when applying them to knowledge-intensive tasks. In this\npaper, we propose a new paradigm dubbed Knowledge Rumination to help the\npre-trained language model utilize that related latent knowledge without\nretrieving it from the external corpus. By simply adding a prompt like \"As far\nas I know\" to the PLMs, we try to review related latent knowledge and inject\nthem back into the model for knowledge consolidation. We apply the proposed\nknowledge rumination to various language models, including RoBERTa, DeBERTa,\nand GPT-3. Experimental results on six commonsense reasoning tasks and GLUE\nbenchmarks demonstrate the effectiveness of our proposed approach, which proves\nthat the knowledge stored in PLMs can be better exploited to enhance\nperformance. Code is available in\nhttps://github.com/zjunlp/knowledge-rumination.\n","authors":["Yunzhi Yao","Peng Wang","Shengyu Mao","Chuanqi Tan","Fei Huang","Huajun Chen","Ningyu Zhang"],"pdf_url":"https://arxiv.org/pdf/2305.08732v3.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.07346v1","updated":"2023-10-11T09:53:43Z","published":"2023-10-11T09:53:43Z","title":"Preliminary Results of a Scientometric Analysis of the German\n Information Retrieval Community 2020-2023","summary":" The German Information Retrieval community is located in two different\nsub-fields: Information and computer science. There are no current studies that\ninvestigate these communities on a scientometric level. Available studies only\nfocus on the information scientific part of the community. We generated a data\nset of 401 recent IR-related publications extracted from six core IR\nconferences from a mainly computer scientific background. We analyze this data\nset at the institutional and researcher level. The data set is publicly\nreleased, and we also demonstrate a mapping use case.\n","authors":["Philipp Schaer","Svetlana Myshkina","Jüri Keller"],"pdf_url":"https://arxiv.org/pdf/2310.07346v1.pdf","comment":"Data available at https://github.com/irgroup/LWDA2023-IR-community"},{"id":"http://arxiv.org/abs/2206.12781v4","updated":"2023-10-11T09:03:18Z","published":"2022-06-26T03:59:41Z","title":"Efficiently Leveraging Multi-level User Intent for Session-based\n Recommendation via Atten-Mixer Network","summary":" Session-based recommendation (SBR) aims to predict the user's next action\nbased on short and dynamic sessions. Recently, there has been an increasing\ninterest in utilizing various elaborately designed graph neural networks (GNNs)\nto capture the pair-wise relationships among items, seemingly suggesting the\ndesign of more complicated models is the panacea for improving the empirical\nperformance. However, these models achieve relatively marginal improvements\nwith exponential growth in model complexity. In this paper, we dissect the\nclassical GNN-based SBR models and empirically find that some sophisticated GNN\npropagations are redundant, given the readout module plays a significant role\nin GNN-based models. Based on this observation, we intuitively propose to\nremove the GNN propagation part, while the readout module will take on more\nresponsibility in the model reasoning process. To this end, we propose the\nMulti-Level Attention Mixture Network (Atten-Mixer), which leverages both\nconcept-view and instance-view readouts to achieve multi-level reasoning over\nitem transitions. As simply enumerating all possible high-level concepts is\ninfeasible for large real-world recommender systems, we further incorporate\nSBR-related inductive biases, i.e., local invariance and inherent priority to\nprune the search space. Experiments on three benchmarks demonstrate the\neffectiveness and efficiency of our proposal. We also have already launched the\nproposed techniques to a large-scale e-commercial online service since April\n2021, with significant improvements of top-tier business metrics demonstrated\nin the online experiments on live traffic.\n","authors":["Peiyan Zhang","Jiayan Guo","Chaozhuo Li","Yueqi Xie","Jaeboum Kim","Yan Zhang","Xing Xie","Haohan Wang","Sunghun Kim"],"pdf_url":"https://arxiv.org/pdf/2206.12781v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.07678v2","updated":"2023-10-11T08:34:42Z","published":"2023-03-14T07:27:30Z","title":"Query2doc: Query Expansion with Large Language Models","summary":" This paper introduces a simple yet effective query expansion approach,\ndenoted as query2doc, to improve both sparse and dense retrieval systems. The\nproposed method first generates pseudo-documents by few-shot prompting large\nlanguage models (LLMs), and then expands the query with generated\npseudo-documents. LLMs are trained on web-scale text corpora and are adept at\nknowledge memorization. The pseudo-documents from LLMs often contain highly\nrelevant information that can aid in query disambiguation and guide the\nretrievers. Experimental results demonstrate that query2doc boosts the\nperformance of BM25 by 3% to 15% on ad-hoc IR datasets, such as MS-MARCO and\nTREC DL, without any model fine-tuning. Furthermore, our method also benefits\nstate-of-the-art dense retrievers in terms of both in-domain and out-of-domain\nresults.\n","authors":["Liang Wang","Nan Yang","Furu Wei"],"pdf_url":"https://arxiv.org/pdf/2303.07678v2.pdf","comment":"Accepted to EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.07281v1","updated":"2023-10-11T08:15:10Z","published":"2023-10-11T08:15:10Z","title":"A Completely Locale-independent Session-based Recommender System by\n Leveraging Trained Model","summary":" In this paper, we propose a solution that won the 10th prize in the KDD Cup\n2023 Challenge Task 2 (Next Product Recommendation for Underrepresented\nLanguages/Locales). Our approach involves two steps: (i) Identify candidate\nitem sets based on co-visitation, and (ii) Re-ranking the items using LightGBM\nwith locale-independent features, including session-based features and product\nsimilarity. The experiment demonstrated that the locale-independent model\nperformed consistently well across different test locales, and performed even\nbetter when incorporating data from other locales into the training.\n","authors":["Yu Tokutake","Chihiro Yamasaki","Yongzhi Jin","Ayuka Inoue","Kei Harada"],"pdf_url":"https://arxiv.org/pdf/2310.07281v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.06282v2","updated":"2023-10-11T02:46:12Z","published":"2023-10-10T03:32:33Z","title":"MuseChat: A Conversational Music Recommendation System for Videos","summary":" We introduce MuseChat, an innovative dialog-based music recommendation\nsystem. This unique platform not only offers interactive user engagement but\nalso suggests music tailored for input videos, so that users can refine and\npersonalize their music selections. In contrast, previous systems predominantly\nemphasized content compatibility, often overlooking the nuances of users'\nindividual preferences. For example, all the datasets only provide basic\nmusic-video pairings or such pairings with textual music descriptions. To\naddress this gap, our research offers three contributions. First, we devise a\nconversation-synthesis method that simulates a two-turn interaction between a\nuser and a recommendation system, which leverages pre-trained music tags and\nartist information. In this interaction, users submit a video to the system,\nwhich then suggests a suitable music piece with a rationale. Afterwards, users\ncommunicate their musical preferences, and the system presents a refined music\nrecommendation with reasoning. Second, we introduce a multi-modal\nrecommendation engine that matches music either by aligning it with visual cues\nfrom the video or by harmonizing visual information, feedback from previously\nrecommended music, and the user's textual input. Third, we bridge music\nrepresentations and textual data with a Large Language Model(Vicuna-7B). This\nalignment equips MuseChat to deliver music recommendations and their underlying\nreasoning in a manner resembling human communication. Our evaluations show that\nMuseChat surpasses existing state-of-the-art models in music retrieval tasks\nand pioneers the integration of the recommendation process within a natural\nlanguage framework.\n","authors":["Zhikang Dong","Bin Chen","Xiulong Liu","Pawel Polak","Peng Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.06282v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07142v1","updated":"2023-10-11T02:36:38Z","published":"2023-10-11T02:36:38Z","title":"Validating Synthetic Usage Data in Living Lab Environments","summary":" Evaluating retrieval performance without editorial relevance judgments is\nchallenging, but instead, user interactions can be used as relevance signals.\nLiving labs offer a way for small-scale platforms to validate information\nretrieval systems with real users. If enough user interaction data are\navailable, click models can be parameterized from historical sessions to\nevaluate systems before exposing users to experimental rankings. However,\ninteraction data are sparse in living labs, and little is studied about how\nclick models can be validated for reliable user simulations when click data are\navailable in moderate amounts.\n This work introduces an evaluation approach for validating synthetic usage\ndata generated by click models in data-sparse human-in-the-loop environments\nlike living labs. We ground our methodology on the click model's estimates\nabout a system ranking compared to a reference ranking for which the relative\nperformance is known. Our experiments compare different click models and their\nreliability and robustness as more session log data becomes available. In our\nsetup, simple click models can reliably determine the relative system\nperformance with already 20 logged sessions for 50 queries. In contrast, more\ncomplex click models require more session data for reliable estimates, but they\nare a better choice in simulated interleaving experiments when enough session\ndata are available. While it is easier for click models to distinguish between\nmore diverse systems, it is harder to reproduce the system ranking based on the\nsame retrieval algorithm with different interpolation weights. Our setup is\nentirely open, and we share the code to reproduce the experiments.\n","authors":["Timo Breuer","Norbert Fuhr","Philipp Schaer"],"pdf_url":"https://arxiv.org/pdf/2310.07142v1.pdf","comment":"25 pages + appendix and references, accepted JDIQ journal paper"},{"id":"http://arxiv.org/abs/2310.07137v1","updated":"2023-10-11T02:22:28Z","published":"2023-10-11T02:22:28Z","title":"AE-smnsMLC: Multi-Label Classification with Semantic Matching and\n Negative Label Sampling for Product Attribute Value Extraction","summary":" Product attribute value extraction plays an important role for many\nreal-world applications in e-Commerce such as product search and\nrecommendation. Previous methods treat it as a sequence labeling task that\nneeds more annotation for position of values in the product text. This limits\ntheir application to real-world scenario in which only attribute values are\nweakly-annotated for each product without their position. Moreover, these\nmethods only use product text (i.e., product title and description) and do not\nconsider the semantic connection between the multiple attribute values of a\ngiven product and its text, which can help attribute value extraction. In this\npaper, we reformulate this task as a multi-label classification task that can\nbe applied for real-world scenario in which only annotation of attribute values\nis available to train models (i.e., annotation of positional information of\nattribute values is not available). We propose a classification model with\nsemantic matching and negative label sampling for attribute value extraction.\nSemantic matching aims to capture semantic interactions between attribute\nvalues of a given product and its text. Negative label sampling aims to enhance\nthe model's ability of distinguishing similar values belonging to the same\nattribute. Experimental results on three subsets of a large real-world\ne-Commerce dataset demonstrate the effectiveness and superiority of our\nproposed model.\n","authors":["Zhongfen Deng","Wei-Te Chen","Lei Chen","Philip S. Yu"],"pdf_url":"https://arxiv.org/pdf/2310.07137v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07874v1","updated":"2023-10-11T20:34:17Z","published":"2023-10-11T20:34:17Z","title":"Refined Mechanism Design for Approximately Structured Priors via Active\n Regression","summary":" We consider the problem of a revenue-maximizing seller with a large number of\nitems $m$ for sale to $n$ strategic bidders, whose valuations are drawn\nindependently from high-dimensional, unknown prior distributions. It is\nwell-known that optimal and even approximately-optimal mechanisms for this\nsetting are notoriously difficult to characterize or compute, and, even when\nthey can be found, are often rife with various counter-intuitive properties. In\nthis paper, following a model introduced recently by Cai and\nDaskalakis~\\cite{cai2022recommender}, we consider the case that bidders' prior\ndistributions can be well-approximated by a topic model. We design an active\nlearning component, responsible for interacting with the bidders and outputting\nlow-dimensional approximations of their types, and a mechanism design\ncomponent, responsible for robustifying mechanisms for the low-dimensional\nmodel to work for the approximate types of the former component. On the active\nlearning front, we cast our problem in the framework of Randomized Linear\nAlgebra (RLA) for regression problems, allowing us to import several\nbreakthrough results from that line of research, and adapt them to our setting.\nOn the mechanism design front, we remove many restrictive assumptions of prior\nwork on the type of access needed to the underlying distributions and the\nassociated mechanisms. To the best of our knowledge, our work is the first to\nformulate connections between mechanism design, and RLA for active learning of\nregression problems, opening the door for further applications of randomized\nlinear algebra primitives to mechanism design.\n","authors":["Christos Boutsikas","Petros Drineas","Marios Mertzanidis","Alexandros Psomas","Paritosh Verma"],"pdf_url":"https://arxiv.org/pdf/2310.07874v1.pdf","comment":"37th Conference on Neural Information Processing Systems (NeurIPS\n 2023)"},{"id":"http://arxiv.org/abs/2310.07815v1","updated":"2023-10-11T18:56:15Z","published":"2023-10-11T18:56:15Z","title":"Language Models As Semantic Indexers","summary":" Semantic identifier (ID) is an important concept in information retrieval\nthat aims to preserve the semantics of objects such as documents and items\ninside their IDs. Previous studies typically adopt a two-stage pipeline to\nlearn semantic IDs by first procuring embeddings using off-the-shelf text\nencoders and then deriving IDs based on the embeddings. However, each step\nintroduces potential information loss and there is usually an inherent mismatch\nbetween the distribution of embeddings within the latent space produced by text\nencoders and the anticipated distribution required for semantic indexing.\nNevertheless, it is non-trivial to design a method that can learn the\ndocument's semantic representations and its hierarchical structure\nsimultaneously, given that semantic IDs are discrete and sequentially\nstructured, and the semantic supervision is deficient. In this paper, we\nintroduce LMINDEXER, a self-supervised framework to learn semantic IDs with a\ngenerative language model. We tackle the challenge of sequential discrete ID by\nintroducing a semantic indexer capable of generating neural sequential discrete\nrepresentations with progressive training and contrastive learning. In response\nto the semantic supervision deficiency, we propose to train the model with a\nself-supervised document reconstruction objective. The learned semantic indexer\ncan facilitate various downstream tasks, such as recommendation and retrieval.\nWe conduct experiments on three tasks including recommendation, product search,\nand document retrieval on five datasets from various domains, where LMINDEXER\noutperforms competitive baselines significantly and consistently.\n","authors":["Bowen Jin","Hansi Zeng","Guoyin Wang","Xiusi Chen","Tianxin Wei","Ruirui Li","Zhengyang Wang","Zheng Li","Yang Li","Hanqing Lu","Suhang Wang","Jiawei Han","Xianfeng Tang"],"pdf_url":"https://arxiv.org/pdf/2310.07815v1.pdf","comment":"9 pages, 3 appendix pages"},{"id":"http://arxiv.org/abs/2310.07786v1","updated":"2023-10-11T18:15:55Z","published":"2023-10-11T18:15:55Z","title":"Non-Stationary Contextual Bandit Learning via Neural Predictive Ensemble\n Sampling","summary":" Real-world applications of contextual bandits often exhibit non-stationarity\ndue to seasonality, serendipity, and evolving social trends. While a number of\nnon-stationary contextual bandit learning algorithms have been proposed in the\nliterature, they excessively explore due to a lack of prioritization for\ninformation of enduring value, or are designed in ways that do not scale in\nmodern applications with high-dimensional user-specific features and large\naction set, or both. In this paper, we introduce a novel non-stationary\ncontextual bandit algorithm that addresses these concerns. It combines a\nscalable, deep-neural-network-based architecture with a carefully designed\nexploration mechanism that strategically prioritizes collecting information\nwith the most lasting value in a non-stationary environment. Through empirical\nevaluations on two real-world recommendation datasets, which exhibit pronounced\nnon-stationarity, we demonstrate that our approach significantly outperforms\nthe state-of-the-art baselines.\n","authors":["Zheqing Zhu","Yueyang Liu","Xu Kuang","Benjamin Van Roy"],"pdf_url":"https://arxiv.org/pdf/2310.07786v1.pdf","comment":null}],"Machine Learning":[{"id":"http://arxiv.org/abs/2305.05658v2","updated":"2023-10-11T17:59:44Z","published":"2023-05-09T17:52:59Z","title":"TidyBot: Personalized Robot Assistance with Large Language Models","summary":" For a robot to personalize physical assistance effectively, it must learn\nuser preferences that can be generally reapplied to future scenarios. In this\nwork, we investigate personalization of household cleanup with robots that can\ntidy up rooms by picking up objects and putting them away. A key challenge is\ndetermining the proper place to put each object, as people's preferences can\nvary greatly depending on personal taste or cultural background. For instance,\none person may prefer storing shirts in the drawer, while another may prefer\nthem on the shelf. We aim to build systems that can learn such preferences from\njust a handful of examples via prior interactions with a particular person. We\nshow that robots can combine language-based planning and perception with the\nfew-shot summarization capabilities of large language models (LLMs) to infer\ngeneralized user preferences that are broadly applicable to future\ninteractions. This approach enables fast adaptation and achieves 91.2% accuracy\non unseen objects in our benchmark dataset. We also demonstrate our approach on\na real-world mobile manipulator called TidyBot, which successfully puts away\n85.0% of objects in real-world test scenarios.\n","authors":["Jimmy Wu","Rika Antonova","Adam Kan","Marion Lepert","Andy Zeng","Shuran Song","Jeannette Bohg","Szymon Rusinkiewicz","Thomas Funkhouser"],"pdf_url":"https://arxiv.org/pdf/2305.05658v2.pdf","comment":"Accepted to Autonomous Robots (AuRo) - Special Issue: Large Language\n Models in Robotics, 2023 and IEEE/RSJ International Conference on Intelligent\n Robots and Systems (IROS), 2023. Project page:\n https://tidybot.cs.princeton.edu"},{"id":"http://arxiv.org/abs/2310.07713v1","updated":"2023-10-11T17:59:05Z","published":"2023-10-11T17:59:05Z","title":"InstructRetro: Instruction Tuning post Retrieval-Augmented Pretraining","summary":" Pretraining auto-regressive large language models (LLMs) with retrieval\ndemonstrates better perplexity and factual accuracy by leveraging external\ndatabases. However, the size of existing pretrained retrieval-augmented LLM is\nstill limited (e.g., Retro has 7.5B parameters), which limits the effectiveness\nof instruction tuning and zero-shot generalization. In this work, we introduce\nRetro 48B, the largest LLM pretrained with retrieval before instruction tuning.\nSpecifically, we continue to pretrain the 43B GPT model on additional 100\nbillion tokens using the Retro augmentation method by retrieving from 1.2\ntrillion tokens. The obtained foundation model, Retro 48B, largely outperforms\nthe original 43B GPT in terms of perplexity. After instruction tuning on Retro,\nInstructRetro demonstrates significant improvement over the instruction tuned\nGPT on zero-shot question answering (QA) tasks. Specifically, the average\nimprovement of InstructRetro is 7% over its GPT counterpart across 8 short-form\nQA tasks, and 10% over GPT across 4 challenging long-form QA tasks.\nSurprisingly, we find that one can ablate the encoder from InstructRetro\narchitecture and directly use its decoder backbone, while achieving comparable\nresults. We hypothesize that pretraining with retrieval makes its decoder good\nat incorporating context for QA. Our results highlights the promising direction\nto obtain a better GPT decoder for QA through continued pretraining with\nretrieval before instruction tuning.\n","authors":["Boxin Wang","Wei Ping","Lawrence McAfee","Peng Xu","Bo Li","Mohammad Shoeybi","Bryan Catanzaro"],"pdf_url":"https://arxiv.org/pdf/2310.07713v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07712v1","updated":"2023-10-11T17:59:02Z","published":"2023-10-11T17:59:02Z","title":"Found in the Middle: Permutation Self-Consistency Improves Listwise\n Ranking in Large Language Models","summary":" Large language models (LLMs) exhibit positional bias in how they use context,\nwhich especially complicates listwise ranking. To address this, we propose\npermutation self-consistency, a form of self-consistency over ranking list\noutputs of black-box LLMs. Our key idea is to marginalize out different list\norders in the prompt to produce an order-independent ranking with less\npositional bias. First, given some input prompt, we repeatedly shuffle the list\nin the prompt and pass it through the LLM while holding the instructions the\nsame. Next, we aggregate the resulting sample of rankings by computing the\ncentral ranking closest in distance to all of them, marginalizing out prompt\norder biases in the process. Theoretically, we prove the robustness of our\nmethod, showing convergence to the true ranking in the presence of random\nperturbations. Empirically, on five list-ranking datasets in sorting and\npassage reranking, our approach improves scores from conventional inference by\nup to 7-18% for GPT-3.5 and 8-16% for LLaMA v2 (70B), surpassing the previous\nstate of the art in passage reranking. Our code is at\nhttps://github.com/castorini/perm-sc.\n","authors":["Raphael Tang","Xinyu Zhang","Xueguang Ma","Jimmy Lin","Ferhan Ture"],"pdf_url":"https://arxiv.org/pdf/2310.07712v1.pdf","comment":"First two authors contributed equally; 10 pages, 6 figures"},{"id":"http://arxiv.org/abs/2310.07711v1","updated":"2023-10-11T17:58:25Z","published":"2023-10-11T17:58:25Z","title":"Growing Brains: Co-emergence of Anatomical and Functional Modularity in\n Recurrent Neural Networks","summary":" Recurrent neural networks (RNNs) trained on compositional tasks can exhibit\nfunctional modularity, in which neurons can be clustered by activity similarity\nand participation in shared computational subtasks. Unlike brains, these RNNs\ndo not exhibit anatomical modularity, in which functional clustering is\ncorrelated with strong recurrent coupling and spatial localization of\nfunctional clusters. Contrasting with functional modularity, which can be\nephemerally dependent on the input, anatomically modular networks form a robust\nsubstrate for solving the same subtasks in the future. To examine whether it is\npossible to grow brain-like anatomical modularity, we apply a recent machine\nlearning method, brain-inspired modular training (BIMT), to a network being\ntrained to solve a set of compositional cognitive tasks. We find that\nfunctional and anatomical clustering emerge together, such that functionally\nsimilar neurons also become spatially localized and interconnected. Moreover,\ncompared to standard $L_1$ or no regularization settings, the model exhibits\nsuperior performance by optimally balancing task performance and network\nsparsity. In addition to achieving brain-like organization in RNNs, our\nfindings also suggest that BIMT holds promise for applications in neuromorphic\ncomputing and enhancing the interpretability of neural network architectures.\n","authors":["Ziming Liu","Mikail Khona","Ila R. Fiete","Max Tegmark"],"pdf_url":"https://arxiv.org/pdf/2310.07711v1.pdf","comment":"8 pages, 6 figures"},{"id":"http://arxiv.org/abs/2310.07710v1","updated":"2023-10-11T17:57:35Z","published":"2023-10-11T17:57:35Z","title":"DiPmark: A Stealthy, Efficient and Resilient Watermark for Large\n Language Models","summary":" Watermarking techniques offer a promising way to secure data via embedding\ncovert information into the data. A paramount challenge in the domain lies in\npreserving the distribution of original data during watermarking. Our research\nextends and refines existing watermarking framework, placing emphasis on the\nimportance of a distribution-preserving (DiP) watermark. Contrary to the\ncurrent strategies, our proposed DiPmark preserves the original token\ndistribution during watermarking (stealthy), is detectable without access to\nthe language model API or weights (efficient), and is robust to moderate\nchanges of tokens (resilient). This is achieved by incorporating a novel\nreweight strategy, combined with a hash function that assigns unique\n\\textit{i.i.d.} ciphers based on the context. The empirical benchmarks of our\napproach underscore its stealthiness, efficiency, and resilience, making it a\nrobust solution for watermarking tasks that demand impeccable quality\npreservation.\n","authors":["Yihan Wu","Zhengmian Hu","Hongyang Zhang","Heng Huang"],"pdf_url":"https://arxiv.org/pdf/2310.07710v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07707v1","updated":"2023-10-11T17:57:14Z","published":"2023-10-11T17:57:14Z","title":"MatFormer: Nested Transformer for Elastic Inference","summary":" Transformer models are deployed in a wide range of settings, from\nmulti-accelerator clusters to standalone mobile phones. The diverse inference\nconstraints in these scenarios necessitate practitioners to train foundation\nmodels such as PaLM 2, Llama, & ViTs as a series of models of varying sizes.\nDue to significant training costs, only a select few model sizes are trained\nand supported, limiting more fine-grained control over relevant tradeoffs,\nincluding latency, cost, and accuracy. This work introduces MatFormer, a nested\nTransformer architecture designed to offer elasticity in a variety of\ndeployment constraints. Each Feed Forward Network (FFN) block of a MatFormer\nmodel is jointly optimized with a few nested smaller FFN blocks. This training\nprocedure allows for the Mix'n'Match of model granularities across layers --\ni.e., a trained universal MatFormer model enables extraction of hundreds of\naccurate smaller models, which were never explicitly optimized. We empirically\ndemonstrate MatFormer's effectiveness across different model classes (decoders\n& encoders), modalities (language & vision), and scales (up to 2.6B\nparameters). We find that a 2.6B decoder-only MatFormer language model (MatLM)\nallows us to extract smaller models spanning from 1.5B to 2.6B, each exhibiting\ncomparable validation loss and one-shot downstream evaluations to their\nindependently trained counterparts. Furthermore, we observe that smaller\nencoders extracted from a universal MatFormer-based ViT (MatViT) encoder\npreserve the metric-space structure for adaptive large-scale retrieval.\nFinally, we showcase that speculative decoding with the accurate and consistent\nsubmodels extracted from MatFormer can further reduce inference latency.\n","authors":[" Devvrit","Sneha Kudugunta","Aditya Kusupati","Tim Dettmers","Kaifeng Chen","Inderjit Dhillon","Yulia Tsvetkov","Hannaneh Hajishirzi","Sham Kakade","Ali Farhadi","Prateek Jain"],"pdf_url":"https://arxiv.org/pdf/2310.07707v1.pdf","comment":"31 pages, 12 figures, first three authors contributed equally"},{"id":"http://arxiv.org/abs/2310.07699v1","updated":"2023-10-11T17:49:13Z","published":"2023-10-11T17:49:13Z","title":"From Scarcity to Efficiency: Improving CLIP Training via Visual-enriched\n Captions","summary":" Web-crawled datasets are pivotal to the success of pre-training\nvision-language models, exemplified by CLIP. However, web-crawled AltTexts can\nbe noisy and potentially irrelevant to images, thereby undermining the crucial\nimage-text alignment. Existing methods for rewriting captions using large\nlanguage models (LLMs) have shown promise on small, curated datasets like CC3M\nand CC12M. Nevertheless, their efficacy on massive web-captured captions is\nconstrained by the inherent noise and randomness in such data. In this study,\nwe address this limitation by focusing on two key aspects: data quality and\ndata variety. Unlike recent LLM rewriting techniques, we emphasize exploiting\nvisual concepts and their integration into the captions to improve data\nquality. For data variety, we propose a novel mixed training scheme that\noptimally leverages AltTexts alongside newly generated Visual-enriched Captions\n(VeC). We use CLIP as one example and adapt the method for CLIP training on\nlarge-scale web-crawled datasets, named VeCLIP. We conduct a comprehensive\nevaluation of VeCLIP across small, medium, and large scales of raw data. Our\nresults show significant advantages in image-text alignment and overall model\nperformance, underscoring the effectiveness of VeCLIP in improving CLIP\ntraining. For example, VeCLIP achieves a remarkable over 20% improvement in\nCOCO and Flickr30k retrieval tasks under the 12M setting. For data efficiency,\nwe also achieve a notable over 3% improvement while using only 14% of the data\nemployed in the vanilla CLIP and 11% in ALIGN.\n","authors":["Zhengfeng Lai","Haotian Zhang","Wentao Wu","Haoping Bai","Aleksei Timofeev","Xianzhi Du","Zhe Gan","Jiulong Shan","Chen-Nee Chuah","Yinfei Yang","Meng Cao"],"pdf_url":"https://arxiv.org/pdf/2310.07699v1.pdf","comment":"CV/ML"},{"id":"http://arxiv.org/abs/2310.07698v1","updated":"2023-10-11T17:46:59Z","published":"2023-10-11T17:46:59Z","title":"SurroCBM: Concept Bottleneck Surrogate Models for Generative Post-hoc\n Explanation","summary":" Explainable AI seeks to bring light to the decision-making processes of\nblack-box models. Traditional saliency-based methods, while highlighting\ninfluential data segments, often lack semantic understanding. Recent\nadvancements, such as Concept Activation Vectors (CAVs) and Concept Bottleneck\nModels (CBMs), offer concept-based explanations but necessitate human-defined\nconcepts. However, human-annotated concepts are expensive to attain. This paper\nintroduces the Concept Bottleneck Surrogate Models (SurroCBM), a novel\nframework that aims to explain the black-box models with automatically\ndiscovered concepts. SurroCBM identifies shared and unique concepts across\nvarious black-box models and employs an explainable surrogate model for\npost-hoc explanations. An effective training strategy using self-generated data\nis proposed to enhance explanation quality continuously. Through extensive\nexperiments, we demonstrate the efficacy of SurroCBM in concept discovery and\nexplanation, underscoring its potential in advancing the field of explainable\nAI.\n","authors":["Bo Pan","Zhenke Liu","Yifei Zhang","Liang Zhao"],"pdf_url":"https://arxiv.org/pdf/2310.07698v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2203.11196v2","updated":"2023-10-11T17:38:26Z","published":"2022-03-18T20:23:41Z","title":"Performance of Deep Learning models with transfer learning for\n multiple-step-ahead forecasts in monthly time series","summary":" Deep Learning and transfer learning models are being used to generate time\nseries forecasts; however, there is scarce evidence about their performance\nprediction that it is more evident for monthly time series. The purpose of this\npaper is to compare Deep Learning models with transfer learning and without\ntransfer learning and other traditional methods used for monthly forecasts to\nanswer three questions about the suitability of Deep Learning and Transfer\nLearning to generate predictions of time series. Time series of M4 and M3\ncompetitions were used for the experiments. The results suggest that deep\nlearning models based on TCN, LSTM, and CNN with transfer learning tend to\nsurpass the performance prediction of other traditional methods. On the other\nhand, TCN and LSTM, trained directly on the target time series, got similar or\nbetter performance than traditional methods for some forecast horizons.\n","authors":["Martín Solís","Luis-Alexander Calvo-Valverde"],"pdf_url":"https://arxiv.org/pdf/2203.11196v2.pdf","comment":"20 pages, 7 figures, 5 tables"},{"id":"http://arxiv.org/abs/2310.07683v1","updated":"2023-10-11T17:34:56Z","published":"2023-10-11T17:34:56Z","title":"Controllable Data Generation Via Iterative Data-Property Mutual Mappings","summary":" Deep generative models have been widely used for their ability to generate\nrealistic data samples in various areas, such as images, molecules, text, and\nspeech. One major goal of data generation is controllability, namely to\ngenerate new data with desired properties. Despite growing interest in the area\nof controllable generation, significant challenges still remain, including 1)\ndisentangling desired properties with unrelated latent variables, 2)\nout-of-distribution property control, and 3) objective optimization for\nout-of-distribution property control. To address these challenges, in this\npaper, we propose a general framework to enhance VAE-based data generators with\nproperty controllability and ensure disentanglement. Our proposed objective can\nbe optimized on both data seen and unseen in the training set. We propose a\ntraining procedure to train the objective in a semi-supervised manner by\niteratively conducting mutual mappings between the data and properties. The\nproposed framework is implemented on four VAE-based controllable generators to\nevaluate its performance on property error, disentanglement, generation\nquality, and training time. The results indicate that our proposed framework\nenables more precise control over the properties of generated samples in a\nshort training time, ensuring the disentanglement and keeping the validity of\nthe generated samples.\n","authors":["Bo Pan","Muran Qin","Shiyu Wang","Yifei Zhang","Liang Zhao"],"pdf_url":"https://arxiv.org/pdf/2310.07683v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07678v1","updated":"2023-10-11T17:21:48Z","published":"2023-10-11T17:21:48Z","title":"Explainable Image Similarity: Integrating Siamese Networks and Grad-CAM","summary":" With the proliferation of image-based applications in various domains, the\nneed for accurate and interpretable image similarity measures has become\nincreasingly critical. Existing image similarity models often lack\ntransparency, making it challenging to understand the reasons why two images\nare considered similar. In this paper, we propose the concept of explainable\nimage similarity, where the goal is the development of an approach, which is\ncapable of providing similarity scores along with visual factual and\ncounterfactual explanations. Along this line, we present a new framework, which\nintegrates Siamese Networks and Grad-CAM for providing explainable image\nsimilarity and discuss the potential benefits and challenges of adopting this\napproach. In addition, we provide a comprehensive discussion about factual and\ncounterfactual explanations provided by the proposed framework for assisting\ndecision making. The proposed approach has the potential to enhance the\ninterpretability, trustworthiness and user acceptance of image-based systems in\nreal-world image similarity applications. The implementation code can be found\nin https://github.com/ioannislivieris/Grad_CAM_Siamese.git.\n","authors":["Ioannis E. Livieris","Emmanuel Pintelas","Niki Kiriakidou","Panagiotis Pintelas"],"pdf_url":"https://arxiv.org/pdf/2310.07678v1.pdf","comment":"The manuscript has been submitted for publication in \"Journal of\n Imaging\""},{"id":"http://arxiv.org/abs/2310.07676v1","updated":"2023-10-11T17:21:03Z","published":"2023-10-11T17:21:03Z","title":"Composite Backdoor Attacks Against Large Language Models","summary":" Large language models (LLMs) have demonstrated superior performance compared\nto previous methods on various tasks, and often serve as the foundation models\nfor many researches and services. However, the untrustworthy third-party LLMs\nmay covertly introduce vulnerabilities for downstream tasks. In this paper, we\nexplore the vulnerability of LLMs through the lens of backdoor attacks.\nDifferent from existing backdoor attacks against LLMs, ours scatters multiple\ntrigger keys in different prompt components. Such a Composite Backdoor Attack\n(CBA) is shown to be stealthier than implanting the same multiple trigger keys\nin only a single component. CBA ensures that the backdoor is activated only\nwhen all trigger keys appear. Our experiments demonstrate that CBA is effective\nin both natural language processing (NLP) and multimodal tasks. For instance,\nwith $3\\%$ poisoning samples against the LLaMA-7B model on the Emotion dataset,\nour attack achieves a $100\\%$ Attack Success Rate (ASR) with a False Triggered\nRate (FTR) below $2.06\\%$ and negligible model accuracy degradation. The unique\ncharacteristics of our CBA can be tailored for various practical scenarios,\ne.g., targeting specific user groups. Our work highlights the necessity of\nincreased security research on the trustworthiness of foundation LLMs.\n","authors":["Hai Huang","Zhengyu Zhao","Michael Backes","Yun Shen","Yang Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.07676v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07672v1","updated":"2023-10-11T17:18:51Z","published":"2023-10-11T17:18:51Z","title":"Stabilizing Estimates of Shapley Values with Control Variates","summary":" Shapley values are among the most popular tools for explaining predictions of\nblackbox machine learning models. However, their high computational cost\nmotivates the use of sampling approximations, inducing a considerable degree of\nuncertainty. To stabilize these model explanations, we propose ControlSHAP, an\napproach based on the Monte Carlo technique of control variates. Our\nmethodology is applicable to any machine learning model and requires virtually\nno extra computation or modeling effort. On several high-dimensional datasets,\nwe find it can produce dramatic reductions in the Monte Carlo variability of\nShapley estimates.\n","authors":["Jeremy Goldwasser","Giles Hooker"],"pdf_url":"https://arxiv.org/pdf/2310.07672v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07668v1","updated":"2023-10-11T17:17:40Z","published":"2023-10-11T17:17:40Z","title":"GRaMuFeN: Graph-based Multi-modal Fake News Detection in Social Media","summary":" The proliferation of social media platforms such as Twitter, Instagram, and\nWeibo has significantly enhanced the dissemination of false information. This\nphenomenon grants both individuals and governmental entities the ability to\nshape public opinions, highlighting the need for deploying effective detection\nmethods. In this paper, we propose GraMuFeN, a model designed to detect fake\ncontent by analyzing both the textual and image content of news. GraMuFeN\ncomprises two primary components: a text encoder and an image encoder. For\ntextual analysis, GraMuFeN treats each text as a graph and employs a Graph\nConvolutional Neural Network (GCN) as the text encoder. Additionally, the\npre-trained ResNet-152, as a Convolutional Neural Network (CNN), has been\nutilized as the image encoder. By integrating the outputs from these two\nencoders and implementing a contrastive similarity loss function, GraMuFeN\nachieves remarkable results. Extensive evaluations conducted on two publicly\navailable benchmark datasets for social media news indicate a 10 % increase in\nmicro F1-Score, signifying improvement over existing state-of-the-art models.\nThese findings underscore the effectiveness of combining GCN and CNN models for\ndetecting fake news in multi-modal data, all while minimizing the additional\ncomputational burden imposed by model parameters.\n","authors":["Makan Kananian","Fatima Badiei","S. AmirAli Gh. Ghahramani"],"pdf_url":"https://arxiv.org/pdf/2310.07668v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07667v1","updated":"2023-10-11T17:16:33Z","published":"2023-10-11T17:16:33Z","title":"Global Minima, Recoverability Thresholds, and Higher-Order Structure in\n GNNS","summary":" We analyze the performance of graph neural network (GNN) architectures from\nthe perspective of random graph theory. Our approach promises to complement\nexisting lenses on GNN analysis, such as combinatorial expressive power and\nworst-case adversarial analysis, by connecting the performance of GNNs to\ntypical-case properties of the training data. First, we theoretically\ncharacterize the nodewise accuracy of one- and two-layer GCNs relative to the\ncontextual stochastic block model (cSBM) and related models. We additionally\nprove that GCNs cannot beat linear models under certain circumstances. Second,\nwe numerically map the recoverability thresholds, in terms of accuracy, of four\ndiverse GNN architectures (GCN, GAT, SAGE, and Graph Transformer) under a\nvariety of assumptions about the data. Sample results of this second analysis\ninclude: heavy-tailed degree distributions enhance GNN performance, GNNs can\nwork well on strongly heterophilous graphs, and SAGE and Graph Transformer can\nperform well on arbitrarily noisy edge data, but no architecture handled\nsufficiently noisy feature data well. Finally, we show how both specific\nhigher-order structures in synthetic data and the mix of empirical structures\nin real data have dramatic effects (usually negative) on GNN performance.\n","authors":["Drake Brown","Trevor Garrity","Kaden Parker","Jason Oliphant","Stone Carson","Cole Hanson","Zachary Boyd"],"pdf_url":"https://arxiv.org/pdf/2310.07667v1.pdf","comment":"28 pages"},{"id":"http://arxiv.org/abs/1910.08883v4","updated":"2023-10-11T17:14:41Z","published":"2019-10-20T03:14:20Z","title":"High-dimensional and universally consistent k-sample tests","summary":" The k-sample testing problem involves determining whether $k$ groups of data\npoints are each drawn from the same distribution. The standard method for\nk-sample testing in biomedicine is Multivariate analysis of variance (MANOVA),\ndespite that it depends on strong, and often unsuitable, parametric\nassumptions. Moreover, independence testing and k-sample testing are closely\nrelated, and several universally consistent high-dimensional independence tests\nsuch as distance correlation (Dcorr) and Hilbert-Schmidt-Independence-Criterion\n(Hsic) enjoy solid theoretical and empirical properties. In this paper, we\nprove that independence tests achieve universally consistent k-sample testing\nand that k-sample statistics such as Energy and Maximum Mean Discrepancy (MMD)\nare precisely equivalent to Dcorr. An empirical evaluation of nonparametric\nindependence tests showed that they generally perform better than the popular\nMANOVA test, even in Gaussian distributed scenarios. The evaluation included\nseveral popular independence statistics and covered a comprehensive set of\nsimulations. Additionally, the testing approach was extended to perform\nmultiway and multilevel tests, which were demonstrated in a simulated study as\nwell as a real-world fMRI brain scans with a set of attributes.\n","authors":["Sambit Panda","Cencheng Shen","Ronan Perry","Jelle Zorn","Antoine Lutz","Carey E. Priebe","Joshua T. Vogelstein"],"pdf_url":"https://arxiv.org/pdf/1910.08883v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07665v1","updated":"2023-10-11T17:11:10Z","published":"2023-10-11T17:11:10Z","title":"Deep Backtracking Counterfactuals for Causally Compliant Explanations","summary":" Counterfactuals can offer valuable insights by answering what would have been\nobserved under altered circumstances, conditional on a factual observation.\nWhereas the classical interventional interpretation of counterfactuals has been\nstudied extensively, backtracking constitutes a less studied alternative the\nbacktracking principle has emerged as an alternative philosophy where all\ncausal laws are kept intact. In the present work, we introduce a practical\nmethod for computing backtracking counterfactuals in structural causal models\nthat consist of deep generative components. To this end, we impose conditions\non the structural assignments that enable the generation of counterfactuals by\nsolving a tractable constrained optimization problem in the structured latent\nspace of a causal model. Our formulation also facilitates a comparison with\nmethods in the field of counterfactual explanations. Compared to these, our\nmethod represents a versatile, modular and causally compliant alternative. We\ndemonstrate these properties experimentally on a modified version of MNIST and\nCelebA.\n","authors":["Klaus-Rudolf Kladny","Julius von Kügelgen","Bernhard Schölkopf","Michael Muehlebach"],"pdf_url":"https://arxiv.org/pdf/2310.07665v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2301.06577v2","updated":"2023-10-11T17:10:13Z","published":"2023-01-16T19:27:16Z","title":"Learning from Very Little Data: On the Value of Landscape Analysis for\n Predicting Software Project Health","summary":" When data is scarce, software analytics can make many mistakes. For example,\nconsider learning predictors for open source project health (e.g. the number of\nclosed pull requests in twelve months time). The training data for this task\nmay be very small (e.g. five years of data, collected every month means just 60\nrows of training data). The models generated from such tiny data sets can make\nmany prediction errors.\n Those errors can be tamed by a {\\em landscape analysis} that selects better\nlearner control parameters. Our niSNEAK tool (a)~clusters the data to find the\ngeneral landscape of the hyperparameters; then (b)~explores a few\nrepresentatives from each part of that landscape. niSNEAK is both faster and\nmore effective than prior state-of-the-art hyperparameter optimization\nalgorithms (e.g. FLASH, HYPEROPT, OPTUNA).\n The configurations found by niSNEAK have far less error than other methods.\nFor example, for project health indicators such as $C$= number of commits;\n$I$=number of closed issues, and $R$=number of closed pull requests, niSNEAK's\n12 month prediction errors are \\{I=0\\%, R=33\\%\\,C=47\\%\\}\n Based on the above, we recommend landscape analytics (e.g. niSNEAK)\nespecially when learning from very small data sets. This paper only explores\nthe application of niSNEAK to project health. That said, we see nothing in\nprinciple that prevents the application of this technique to a wider range of\nproblems.\n To assist other researchers in repeating, improving, or even refuting our\nresults, all our scripts and data are available on GitHub at\nhttps://github.com/zxcv123456qwe/niSneak\n","authors":["Andre Lustosa","Tim Menzies"],"pdf_url":"https://arxiv.org/pdf/2301.06577v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.08703v3","updated":"2023-10-11T17:00:34Z","published":"2023-05-15T15:06:20Z","title":"Schema-adaptable Knowledge Graph Construction","summary":" Conventional Knowledge Graph Construction (KGC) approaches typically follow\nthe static information extraction paradigm with a closed set of pre-defined\nschema. As a result, such approaches fall short when applied to dynamic\nscenarios or domains, whereas a new type of knowledge emerges. This\nnecessitates a system that can handle evolving schema automatically to extract\ninformation for KGC. To address this need, we propose a new task called\nschema-adaptable KGC, which aims to continually extract entity, relation, and\nevent based on a dynamically changing schema graph without re-training. We\nfirst split and convert existing datasets based on three principles to build a\nbenchmark, i.e., horizontal schema expansion, vertical schema expansion, and\nhybrid schema expansion; then investigate the schema-adaptable performance of\nseveral well-known approaches such as Text2Event, TANL, UIE and GPT-3.5. We\nfurther propose a simple yet effective baseline dubbed \\textsc{AdaKGC}, which\ncontains schema-enriched prefix instructor and schema-conditioned dynamic\ndecoding to better handle evolving schema. Comprehensive experimental results\nillustrate that AdaKGC can outperform baselines but still have room for\nimprovement. We hope the proposed work can deliver benefits to the community.\nCode and datasets available at https://github.com/zjunlp/AdaKGC.\n","authors":["Hongbin Ye","Honghao Gui","Xin Xu","Huajun Chen","Ningyu Zhang"],"pdf_url":"https://arxiv.org/pdf/2305.08703v3.pdf","comment":"EMNLP 2023 (Findings)"},{"id":"http://arxiv.org/abs/2310.07658v1","updated":"2023-10-11T17:00:03Z","published":"2023-10-11T17:00:03Z","title":"The First Pathloss Radio Map Prediction Challenge","summary":" To foster research and facilitate fair comparisons among recently proposed\npathloss radio map prediction methods, we have launched the ICASSP 2023 First\nPathloss Radio Map Prediction Challenge. In this short overview paper, we\nbriefly describe the pathloss prediction problem, the provided datasets, the\nchallenge task and the challenge evaluation methodology. Finally, we present\nthe results of the challenge.\n","authors":["Çağkan Yapar","Fabian Jaensch","Ron Levie","Gitta Kutyniok","Giuseppe Caire"],"pdf_url":"https://arxiv.org/pdf/2310.07658v1.pdf","comment":"ICASSP 2023"},{"id":"http://arxiv.org/abs/2310.07654v1","updated":"2023-10-11T16:54:57Z","published":"2023-10-11T16:54:57Z","title":"Audio-Visual Neural Syntax Acquisition","summary":" We study phrase structure induction from visually-grounded speech. The core\nidea is to first segment the speech waveform into sequences of word segments,\nand subsequently induce phrase structure using the inferred segment-level\ncontinuous representations. We present the Audio-Visual Neural Syntax Learner\n(AV-NSL) that learns phrase structure by listening to audio and looking at\nimages, without ever being exposed to text. By training on paired images and\nspoken captions, AV-NSL exhibits the capability to infer meaningful phrase\nstructures that are comparable to those derived by naturally-supervised text\nparsers, for both English and German. Our findings extend prior work in\nunsupervised language acquisition from speech and grounded grammar induction,\nand present one approach to bridge the gap between the two topics.\n","authors":["Cheng-I Jeff Lai","Freda Shi","Puyuan Peng","Yoon Kim","Kevin Gimpel","Shiyu Chang","Yung-Sung Chuang","Saurabhchand Bhati","David Cox","David Harwath","Yang Zhang","Karen Livescu","James Glass"],"pdf_url":"https://arxiv.org/pdf/2310.07654v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.13172v2","updated":"2023-10-11T16:51:50Z","published":"2023-05-22T16:00:00Z","title":"Editing Large Language Models: Problems, Methods, and Opportunities","summary":" Despite the ability to train capable LLMs, the methodology for maintaining\ntheir relevancy and rectifying errors remains elusive. To this end, the past\nfew years have witnessed a surge in techniques for editing LLMs, the objective\nof which is to efficiently alter the behavior of LLMs within a specific domain\nwithout negatively impacting performance across other inputs. This paper\nembarks on a deep exploration of the problems, methods, and opportunities\nrelated to model editing for LLMs. In particular, we provide an exhaustive\noverview of the task definition and challenges associated with model editing,\nalong with an in-depth empirical analysis of the most progressive methods\ncurrently at our disposal. We also build a new benchmark dataset to facilitate\na more robust evaluation and pinpoint enduring issues intrinsic to existing\ntechniques. Our objective is to provide valuable insights into the\neffectiveness and feasibility of each editing technique, thereby assisting the\ncommunity in making informed decisions on the selection of the most appropriate\nmethod for a specific task or context. Code and datasets are available at\nhttps://github.com/zjunlp/EasyEdit.\n","authors":["Yunzhi Yao","Peng Wang","Bozhong Tian","Siyuan Cheng","Zhoubo Li","Shumin Deng","Huajun Chen","Ningyu Zhang"],"pdf_url":"https://arxiv.org/pdf/2305.13172v2.pdf","comment":"EMNLP 2023. Updated with new experiments"},{"id":"http://arxiv.org/abs/2310.07648v1","updated":"2023-10-11T16:45:44Z","published":"2023-10-11T16:45:44Z","title":"Hypercomplex Multimodal Emotion Recognition from EEG and Peripheral\n Physiological Signals","summary":" Multimodal emotion recognition from physiological signals is receiving an\nincreasing amount of attention due to the impossibility to control them at will\nunlike behavioral reactions, thus providing more reliable information. Existing\ndeep learning-based methods still rely on extracted handcrafted features, not\ntaking full advantage of the learning ability of neural networks, and often\nadopt a single-modality approach, while human emotions are inherently expressed\nin a multimodal way. In this paper, we propose a hypercomplex multimodal\nnetwork equipped with a novel fusion module comprising parameterized\nhypercomplex multiplications. Indeed, by operating in a hypercomplex domain the\noperations follow algebraic rules which allow to model latent relations among\nlearned feature dimensions for a more effective fusion step. We perform\nclassification of valence and arousal from electroencephalogram (EEG) and\nperipheral physiological signals, employing the publicly available database\nMAHNOB-HCI surpassing a multimodal state-of-the-art network. The code of our\nwork is freely available at https://github.com/ispamm/MHyEEG.\n","authors":["Eleonora Lopez","Eleonora Chiarantano","Eleonora Grassucci","Danilo Comminiello"],"pdf_url":"https://arxiv.org/pdf/2310.07648v1.pdf","comment":"Published at IEEE ICASSP workshops 2023"},{"id":"http://arxiv.org/abs/2310.07644v1","updated":"2023-10-11T16:40:57Z","published":"2023-10-11T16:40:57Z","title":"Rethinking the BERT-like Pretraining for DNA Sequences","summary":" With the success of large-scale pretraining in NLP, there is an increasing\ntrend of applying it to the domain of life sciences. In particular, pretraining\nmethods based on DNA sequences have garnered growing attention due to their\npotential to capture generic information about genes. However, existing\npretraining methods for DNA sequences largely rely on direct adoptions of BERT\npretraining from NLP, lacking a comprehensive understanding and a specifically\ntailored approach. To address this research gap, we first conducted a series of\nexploratory experiments and gained several insightful observations: 1) In the\nfine-tuning phase of downstream tasks, when using K-mer overlapping\ntokenization instead of K-mer non-overlapping tokenization, both overlapping\nand non-overlapping pretraining weights show consistent performance\nimprovement.2) During the pre-training process, using K-mer overlapping\ntokenization quickly produces clear K-mer embeddings and reduces the loss to a\nvery low level, while using K-mer non-overlapping tokenization results in less\ndistinct embeddings and continuously decreases the loss. 3) Using overlapping\ntokenization causes the self-attention in the intermediate layers of\npre-trained models to tend to overly focus on certain tokens, reflecting that\nthese layers are not adequately optimized. In summary, overlapping tokenization\ncan benefit the fine-tuning of downstream tasks but leads to inadequate\npretraining with fast convergence. To unleash the pretraining potential, we\nintroduce a novel approach called RandomMask, which gradually increases the\ntask difficulty of BERT-like pretraining by continuously expanding its mask\nboundary, forcing the model to learn more knowledge. RandomMask is simple but\neffective, achieving top-tier performance across 26 datasets of 28 datasets\nspanning 7 downstream tasks.\n","authors":["Chaoqi Liang","Weiqiang Bai","Lifeng Qiao","Yuchen Ren","Jianle Sun","Peng Ye","Hongliang Yan","Xinzhu Ma","Wangmeng Zuo","Wanli Ouyang"],"pdf_url":"https://arxiv.org/pdf/2310.07644v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07641v1","updated":"2023-10-11T16:38:11Z","published":"2023-10-11T16:38:11Z","title":"Evaluating Large Language Models at Evaluating Instruction Following","summary":" As research in large language models (LLMs) continues to accelerate,\nLLM-based evaluation has emerged as a scalable and cost-effective alternative\nto human evaluations for comparing the ever increasing list of models. This\npaper investigates the efficacy of these \"LLM evaluators\", particularly in\nusing them to assess instruction following, a metric that gauges how closely\ngenerated text adheres to the given instruction. We introduce a challenging\nmeta-evaluation benchmark, LLMBar, designed to test the ability of an LLM\nevaluator in discerning instruction-following outputs. The authors manually\ncurated 419 pairs of outputs, one adhering to instructions while the other\ndiverging, yet may possess deceptive qualities that mislead an LLM evaluator,\ne.g., a more engaging tone. Contrary to existing meta-evaluation, we discover\nthat different evaluators (i.e., combinations of LLMs and prompts) exhibit\ndistinct performance on LLMBar and even the highest-scoring ones have\nsubstantial room for improvement. We also present a novel suite of prompting\nstrategies that further close the gap between LLM and human evaluators. With\nLLMBar, we hope to offer more insight into LLM evaluators and foster future\nresearch in developing better instruction-following models.\n","authors":["Zhiyuan Zeng","Jiatong Yu","Tianyu Gao","Yu Meng","Tanya Goyal","Danqi Chen"],"pdf_url":"https://arxiv.org/pdf/2310.07641v1.pdf","comment":"Under review"},{"id":"http://arxiv.org/abs/2207.10062v3","updated":"2023-10-11T16:32:49Z","published":"2022-07-20T17:47:54Z","title":"DataPerf: Benchmarks for Data-Centric AI Development","summary":" Machine learning research has long focused on models rather than datasets,\nand prominent datasets are used for common ML tasks without regard to the\nbreadth, difficulty, and faithfulness of the underlying problems. Neglecting\nthe fundamental importance of data has given rise to inaccuracy, bias, and\nfragility in real-world applications, and research is hindered by saturation\nacross existing dataset benchmarks. In response, we present DataPerf, a\ncommunity-led benchmark suite for evaluating ML datasets and data-centric\nalgorithms. We aim to foster innovation in data-centric AI through competition,\ncomparability, and reproducibility. We enable the ML community to iterate on\ndatasets, instead of just architectures, and we provide an open, online\nplatform with multiple rounds of challenges to support this iterative\ndevelopment. The first iteration of DataPerf contains five benchmarks covering\na wide spectrum of data-centric techniques, tasks, and modalities in vision,\nspeech, acquisition, debugging, and diffusion prompting, and we support hosting\nnew contributed benchmarks from the community. The benchmarks, online\nevaluation platform, and baseline implementations are open source, and the\nMLCommons Association will maintain DataPerf to ensure long-term benefits to\nacademia and industry.\n","authors":["Mark Mazumder","Colby Banbury","Xiaozhe Yao","Bojan Karlaš","William Gaviria Rojas","Sudnya Diamos","Greg Diamos","Lynn He","Alicia Parrish","Hannah Rose Kirk","Jessica Quaye","Charvi Rastogi","Douwe Kiela","David Jurado","David Kanter","Rafael Mosquera","Juan Ciro","Lora Aroyo","Bilge Acun","Lingjiao Chen","Mehul Smriti Raje","Max Bartolo","Sabri Eyuboglu","Amirata Ghorbani","Emmett Goodman","Oana Inel","Tariq Kane","Christine R. Kirkpatrick","Tzu-Sheng Kuo","Jonas Mueller","Tristan Thrush","Joaquin Vanschoren","Margaret Warren","Adina Williams","Serena Yeung","Newsha Ardalani","Praveen Paritosh","Lilith Bath-Leah","Ce Zhang","James Zou","Carole-Jean Wu","Cody Coleman","Andrew Ng","Peter Mattson","Vijay Janapa Reddi"],"pdf_url":"https://arxiv.org/pdf/2207.10062v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07632v1","updated":"2023-10-11T16:25:45Z","published":"2023-10-11T16:25:45Z","title":"Prompt Backdoors in Visual Prompt Learning","summary":" Fine-tuning large pre-trained computer vision models is infeasible for\nresource-limited users. Visual prompt learning (VPL) has thus emerged to\nprovide an efficient and flexible alternative to model fine-tuning through\nVisual Prompt as a Service (VPPTaaS). Specifically, the VPPTaaS provider\noptimizes a visual prompt given downstream data, and downstream users can use\nthis prompt together with the large pre-trained model for prediction. However,\nthis new learning paradigm may also pose security risks when the VPPTaaS\nprovider instead provides a malicious visual prompt. In this paper, we take the\nfirst step to explore such risks through the lens of backdoor attacks.\nSpecifically, we propose BadVisualPrompt, a simple yet effective backdoor\nattack against VPL. For example, poisoning $5\\%$ CIFAR10 training data leads to\nabove $99\\%$ attack success rates with only negligible model accuracy drop by\n$1.5\\%$. In particular, we identify and then address a new technical challenge\nrelated to interactions between the backdoor trigger and visual prompt, which\ndoes not exist in conventional, model-level backdoors. Moreover, we provide\nin-depth analyses of seven backdoor defenses from model, prompt, and input\nlevels. Overall, all these defenses are either ineffective or impractical to\nmitigate our BadVisualPrompt, implying the critical vulnerability of VPL.\n","authors":["Hai Huang","Zhengyu Zhao","Michael Backes","Yun Shen","Yang Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.07632v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07631v1","updated":"2023-10-11T16:24:06Z","published":"2023-10-11T16:24:06Z","title":"Graph Transformer Network for Flood Forecasting with Heterogeneous\n Covariates","summary":" Floods can be very destructive causing heavy damage to life, property, and\nlivelihoods. Global climate change and the consequent sea-level rise have\nincreased the occurrence of extreme weather events, resulting in elevated and\nfrequent flood risk. Therefore, accurate and timely flood forecasting in\ncoastal river systems is critical to facilitate good flood management. However,\nthe computational tools currently used are either slow or inaccurate. In this\npaper, we propose a Flood prediction tool using Graph Transformer Network\n(FloodGTN) for river systems. More specifically, FloodGTN learns the\nspatio-temporal dependencies of water levels at different monitoring stations\nusing Graph Neural Networks (GNNs) and an LSTM. It is currently implemented to\nconsider external covariates such as rainfall, tide, and the settings of\nhydraulic structures (e.g., outflows of dams, gates, pumps, etc.) along the\nriver. We use a Transformer to learn the attention given to external covariates\nin computing water levels. We apply the FloodGTN tool to data from the South\nFlorida Water Management District, which manages a coastal area prone to\nfrequent storms and hurricanes. Experimental results show that FloodGTN\noutperforms the physics-based model (HEC-RAS) by achieving higher accuracy with\n70% improvement while speeding up run times by at least 500x.\n","authors":["Jimeng Shi","Vitalii Stebliankin","Zhaonan Wang","Shaowen Wang","Giri Narasimhan"],"pdf_url":"https://arxiv.org/pdf/2310.07631v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07630v1","updated":"2023-10-11T16:23:07Z","published":"2023-10-11T16:23:07Z","title":"Differentiable Euler Characteristic Transforms for Shape Classification","summary":" The Euler Characteristic Transform (ECT) has proven to be a powerful\nrepresentation, combining geometrical and topological characteristics of shapes\nand graphs. However, the ECT was hitherto unable to learn task-specific\nrepresentations. We overcome this issue and develop a novel computational layer\nthat enables learning the ECT in an end-to-end fashion. Our method DECT is fast\nand computationally efficient, while exhibiting performance on a par with more\ncomplex models in both graph and point cloud classification tasks. Moreover, we\nshow that this seemingly unexpressive statistic still provides the same\ntopological expressivity as more complex topological deep learning layers\nprovide.\n","authors":["Ernst Roell","Bastian Rieck"],"pdf_url":"https://arxiv.org/pdf/2310.07630v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.08268v3","updated":"2023-10-11T16:17:20Z","published":"2023-03-14T23:01:27Z","title":"Chat with the Environment: Interactive Multimodal Perception Using Large\n Language Models","summary":" Programming robot behavior in a complex world faces challenges on multiple\nlevels, from dextrous low-level skills to high-level planning and reasoning.\nRecent pre-trained Large Language Models (LLMs) have shown remarkable reasoning\nability in few-shot robotic planning. However, it remains challenging to ground\nLLMs in multimodal sensory input and continuous action output, while enabling a\nrobot to interact with its environment and acquire novel information as its\npolicies unfold. We develop a robot interaction scenario with a partially\nobservable state, which necessitates a robot to decide on a range of epistemic\nactions in order to sample sensory information among multiple modalities,\nbefore being able to execute the task correctly. Matcha (Multimodal environment\nchatting) agent, an interactive perception framework, is therefore proposed\nwith an LLM as its backbone, whose ability is exploited to instruct epistemic\nactions and to reason over the resulting multimodal sensations (vision, sound,\nhaptics, proprioception), as well as to plan an entire task execution based on\nthe interactively acquired information. Our study demonstrates that LLMs can\nprovide high-level planning and reasoning skills and control interactive robot\nbehavior in a multimodal environment, while multimodal modules with the context\nof the environmental state help ground the LLMs and extend their processing\nability. The project website can be found at https://matcha-agent.github.io.\n","authors":["Xufeng Zhao","Mengdi Li","Cornelius Weber","Muhammad Burhan Hafez","Stefan Wermter"],"pdf_url":"https://arxiv.org/pdf/2303.08268v3.pdf","comment":"IROS2023, Detroit. See the project website at\n https://matcha-agent.github.io"},{"id":"http://arxiv.org/abs/2305.11141v3","updated":"2023-10-11T16:16:18Z","published":"2023-05-18T17:35:35Z","title":"Clifford Group Equivariant Neural Networks","summary":" We introduce Clifford Group Equivariant Neural Networks: a novel approach for\nconstructing $\\mathrm{O}(n)$- and $\\mathrm{E}(n)$-equivariant models. We\nidentify and study the $\\textit{Clifford group}$, a subgroup inside the\nClifford algebra whose definition we adjust to achieve several favorable\nproperties. Primarily, the group's action forms an orthogonal automorphism that\nextends beyond the typical vector space to the entire Clifford algebra while\nrespecting the multivector grading. This leads to several non-equivalent\nsubrepresentations corresponding to the multivector decomposition. Furthermore,\nwe prove that the action respects not just the vector space structure of the\nClifford algebra but also its multiplicative structure, i.e., the geometric\nproduct. These findings imply that every polynomial in multivectors, An\nadvantage worth mentioning is that we obtain expressive layers that can\nelegantly generalize to inner-product spaces of any dimension. We demonstrate,\nnotably from a single core implementation, state-of-the-art performance on\nseveral distinct tasks, including a three-dimensional $n$-body experiment, a\nfour-dimensional Lorentz-equivariant high-energy physics experiment, and a\nfive-dimensional convex hull experiment.\n","authors":["David Ruhe","Johannes Brandstetter","Patrick Forré"],"pdf_url":"https://arxiv.org/pdf/2305.11141v3.pdf","comment":"Published at NeurIPS 2023 (Oral)"},{"id":"http://arxiv.org/abs/2310.07626v1","updated":"2023-10-11T16:09:09Z","published":"2023-10-11T16:09:09Z","title":"Unsupervised Learning of Sea Surface Height Interpolation from\n Multi-variate Simulated Satellite Observations","summary":" Satellite-based remote sensing missions have revolutionized our understanding\nof the Ocean state and dynamics. Among them, spaceborne altimetry provides\nvaluable measurements of Sea Surface Height (SSH), which is used to estimate\nsurface geostrophic currents. However, due to the sensor technology employed,\nimportant gaps occur in SSH observations. Complete SSH maps are produced by the\naltimetry community using linear Optimal Interpolations (OI) such as the\nwidely-used Data Unification and Altimeter Combination System (DUACS). However,\nOI is known for producing overly smooth fields and thus misses some\nmesostructures and eddies. On the other hand, Sea Surface Temperature (SST)\nproducts have much higher data coverage and SST is physically linked to\ngeostrophic currents through advection. We design a realistic twin experiment\nto emulate the satellite observations of SSH and SST to evaluate interpolation\nmethods. We introduce a deep learning network able to use SST information, and\na trainable in two settings: one where we have no access to ground truth during\ntraining and one where it is accessible. Our investigation involves a\ncomparative analysis of the aforementioned network when trained using either\nsupervised or unsupervised loss functions. We assess the quality of SSH\nreconstructions and further evaluate the network's performance in terms of eddy\ndetection and physical properties. We find that it is possible, even in an\nunsupervised setting to use SST to improve reconstruction performance compared\nto SST-agnostic interpolations. We compare our reconstructions to DUACS's and\nreport a decrease of 41\\% in terms of root mean squared error.\n","authors":["Theo Archambault","Arthur Filoche","Anastase Charantonis","Dominique Bereziat","Sylvie Thiria"],"pdf_url":"https://arxiv.org/pdf/2310.07626v1.pdf","comment":"submitted to JAMES. 26 pages"},{"id":"http://arxiv.org/abs/2305.12534v2","updated":"2023-10-11T16:05:21Z","published":"2023-05-21T18:26:31Z","title":"BertRLFuzzer: A BERT and Reinforcement Learning based Fuzzer","summary":" We present a novel tool BertRLFuzzer, a BERT and Reinforcement Learning (RL)\nbased fuzzer aimed at finding security vulnerabilities for Web applications.\nBertRLFuzzer works as follows: given a set of seed inputs, the fuzzer performs\ngrammar-adhering and attack-provoking mutation operations on them to generate\ncandidate attack vectors. The key insight of BertRLFuzzer is the use of RL with\na BERT model as an agent to guide the fuzzer to efficiently learn\ngrammar-adhering and attack-provoking mutation operators. In order to establish\nthe efficacy of BertRLFuzzer we compare it against a total of 13 black box and\nwhite box fuzzers over a benchmark of 9 victim websites with over 16K LOC. We\nobserved a significant improvement, relative to the nearest competing tool, in\nterms of time to first attack (54% less), new vulnerabilities found (17 new\nvulnerabilities), and attack rate (4.4% more attack vectors generated).\n","authors":["Piyush Jha","Joseph Scott","Jaya Sriram Ganeshna","Mudit Singh","Vijay Ganesh"],"pdf_url":"https://arxiv.org/pdf/2305.12534v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07612v1","updated":"2023-10-11T15:56:55Z","published":"2023-10-11T15:56:55Z","title":"PHYDI: Initializing Parameterized Hypercomplex Neural Networks as\n Identity Functions","summary":" Neural models based on hypercomplex algebra systems are growing and\nprolificating for a plethora of applications, ranging from computer vision to\nnatural language processing. Hand in hand with their adoption, parameterized\nhypercomplex neural networks (PHNNs) are growing in size and no techniques have\nbeen adopted so far to control their convergence at a large scale. In this\npaper, we study PHNNs convergence and propose parameterized hypercomplex\nidentity initialization (PHYDI), a method to improve their convergence at\ndifferent scales, leading to more robust performance when the number of layers\nscales up, while also reaching the same performance with fewer iterations. We\nshow the effectiveness of this approach in different benchmarks and with common\nPHNNs with ResNets- and Transformer-based architecture. The code is available\nat https://github.com/ispamm/PHYDI.\n","authors":["Matteo Mancanelli","Eleonora Grassucci","Aurelio Uncini","Danilo Comminiello"],"pdf_url":"https://arxiv.org/pdf/2310.07612v1.pdf","comment":"Accepted at IEEE MLSP 2023 (Honorable Mention TOP 5% Outstanding\n Papers)"},{"id":"http://arxiv.org/abs/2203.06768v4","updated":"2023-10-11T15:46:16Z","published":"2022-03-13T21:39:24Z","title":"Probabilistically Robust Recourse: Navigating the Trade-offs between\n Costs and Robustness in Algorithmic Recourse","summary":" As machine learning models are increasingly being employed to make\nconsequential decisions in real-world settings, it becomes critical to ensure\nthat individuals who are adversely impacted (e.g., loan denied) by the\npredictions of these models are provided with a means for recourse. While\nseveral approaches have been proposed to construct recourses for affected\nindividuals, the recourses output by these methods either achieve low costs\n(i.e., ease-of-implementation) or robustness to small perturbations (i.e.,\nnoisy implementations of recourses), but not both due to the inherent\ntrade-offs between the recourse costs and robustness. Furthermore, prior\napproaches do not provide end users with any agency over navigating the\naforementioned trade-offs. In this work, we address the above challenges by\nproposing the first algorithmic framework which enables users to effectively\nmanage the recourse cost vs. robustness trade-offs. More specifically, our\nframework Probabilistically ROBust rEcourse (\\texttt{PROBE}) lets users choose\nthe probability with which a recourse could get invalidated (recourse\ninvalidation rate) if small changes are made to the recourse i.e., the recourse\nis implemented somewhat noisily. To this end, we propose a novel objective\nfunction which simultaneously minimizes the gap between the achieved\n(resulting) and desired recourse invalidation rates, minimizes recourse costs,\nand also ensures that the resulting recourse achieves a positive model\nprediction. We develop novel theoretical results to characterize the recourse\ninvalidation rates corresponding to any given instance w.r.t. different classes\nof underlying models (e.g., linear models, tree based models etc.), and\nleverage these results to efficiently optimize the proposed objective.\nExperimental evaluation with multiple real world datasets demonstrates the\nefficacy of the proposed framework.\n","authors":["Martin Pawelczyk","Teresa Datta","Johannes van-den-Heuvel","Gjergji Kasneci","Himabindu Lakkaraju"],"pdf_url":"https://arxiv.org/pdf/2203.06768v4.pdf","comment":"ICLR 2023, camera ready version"},{"id":"http://arxiv.org/abs/2310.07598v1","updated":"2023-10-11T15:38:53Z","published":"2023-10-11T15:38:53Z","title":"Survey on Imbalanced Data, Representation Learning and SEP Forecasting","summary":" Deep Learning methods have significantly advanced various data-driven tasks\nsuch as regression, classification, and forecasting. However, much of this\nprogress has been predicated on the strong but often unrealistic assumption\nthat training datasets are balanced with respect to the targets they contain.\nThis misalignment with real-world conditions, where data is frequently\nimbalanced, hampers the effectiveness of such models in practical applications.\nMethods that reconsider that assumption and tackle real-world imbalances have\nbegun to emerge and explore avenues to address this challenge. One such\npromising avenue is representation learning, which enables models to capture\ncomplex data characteristics and generalize better to minority classes. By\nfocusing on a richer representation of the feature space, these techniques hold\nthe potential to mitigate the impact of data imbalance. In this survey, we\npresent deep learning works that step away from the balanced-data assumption,\nemploying strategies like representation learning to better approximate\nreal-world imbalances. We also highlight a critical application in SEP\nforecasting where addressing data imbalance is paramount for success.\n","authors":["Josias Moukpe"],"pdf_url":"https://arxiv.org/pdf/2310.07598v1.pdf","comment":"Survey Paper, 4 figures, 16 pages"},{"id":"http://arxiv.org/abs/2310.07596v1","updated":"2023-10-11T15:37:31Z","published":"2023-10-11T15:37:31Z","title":"Prospective Side Information for Latent MDPs","summary":" In many interactive decision-making settings, there is latent and unobserved\ninformation that remains fixed. Consider, for example, a dialogue system, where\ncomplete information about a user, such as the user's preferences, is not\ngiven. In such an environment, the latent information remains fixed throughout\neach episode, since the identity of the user does not change during an\ninteraction. This type of environment can be modeled as a Latent Markov\nDecision Process (LMDP), a special instance of Partially Observed Markov\nDecision Processes (POMDPs). Previous work established exponential lower bounds\nin the number of latent contexts for the LMDP class. This puts forward a\nquestion: under which natural assumptions a near-optimal policy of an LMDP can\nbe efficiently learned? In this work, we study the class of LMDPs with {\\em\nprospective side information}, when an agent receives additional, weakly\nrevealing, information on the latent context at the beginning of each episode.\nWe show that, surprisingly, this problem is not captured by contemporary\nsettings and algorithms designed for partially observed environments. We then\nestablish that any sample efficient algorithm must suffer at least\n$\\Omega(K^{2/3})$-regret, as opposed to standard $\\Omega(\\sqrt{K})$ lower\nbounds, and design an algorithm with a matching upper bound.\n","authors":["Jeongyeol Kwon","Yonathan Efroni","Shie Mannor","Constantine Caramanis"],"pdf_url":"https://arxiv.org/pdf/2310.07596v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07592v1","updated":"2023-10-11T15:35:20Z","published":"2023-10-11T15:35:20Z","title":"Transformers for Green Semantic Communication: Less Energy, More\n Semantics","summary":" Semantic communication aims to transmit meaningful and effective information\nrather than focusing on individual symbols or bits, resulting in benefits like\nreduced latency, bandwidth usage, and higher throughput compared to traditional\ncommunication. However, semantic communication poses significant challenges due\nto the need for universal metrics for benchmarking the joint effects of\nsemantic information loss and practical energy consumption. This research\npresents a novel multi-objective loss function named \"Energy-Optimized Semantic\nLoss\" (EOSL), addressing the challenge of balancing semantic information loss\nand energy consumption. Through comprehensive experiments on transformer\nmodels, including CPU and GPU energy usage, it is demonstrated that EOSL-based\nencoder model selection can save up to 90\\% of energy while achieving a 44\\%\nimprovement in semantic similarity performance during inference in this\nexperiment. This work paves the way for energy-efficient neural network\nselection and the development of greener semantic communication architectures.\n","authors":["Shubhabrata Mukherjee","Cory Beard","Sejun Song"],"pdf_url":"https://arxiv.org/pdf/2310.07592v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2208.14137v3","updated":"2023-10-11T15:34:51Z","published":"2022-08-30T10:35:32Z","title":"On the Trade-Off between Actionable Explanations and the Right to be\n Forgotten","summary":" As machine learning (ML) models are increasingly being deployed in\nhigh-stakes applications, policymakers have suggested tighter data protection\nregulations (e.g., GDPR, CCPA). One key principle is the \"right to be\nforgotten\" which gives users the right to have their data deleted. Another key\nprinciple is the right to an actionable explanation, also known as algorithmic\nrecourse, allowing users to reverse unfavorable decisions. To date, it is\nunknown whether these two principles can be operationalized simultaneously.\nTherefore, we introduce and study the problem of recourse invalidation in the\ncontext of data deletion requests. More specifically, we theoretically and\nempirically analyze the behavior of popular state-of-the-art algorithms and\ndemonstrate that the recourses generated by these algorithms are likely to be\ninvalidated if a small number of data deletion requests (e.g., 1 or 2) warrant\nupdates of the predictive model. For the setting of differentiable models, we\nsuggest a framework to identify a minimal subset of critical training points\nwhich, when removed, maximize the fraction of invalidated recourses. Using our\nframework, we empirically show that the removal of as little as 2 data\ninstances from the training set can invalidate up to 95 percent of all\nrecourses output by popular state-of-the-art algorithms. Thus, our work raises\nfundamental questions about the compatibility of \"the right to an actionable\nexplanation\" in the context of the \"right to be forgotten\", while also\nproviding constructive insights on the determining factors of recourse\nrobustness.\n","authors":["Martin Pawelczyk","Tobias Leemann","Asia Biega","Gjergji Kasneci"],"pdf_url":"https://arxiv.org/pdf/2208.14137v3.pdf","comment":"ICLR 2023 camera ready version"},{"id":"http://arxiv.org/abs/2310.07588v1","updated":"2023-10-11T15:28:44Z","published":"2023-10-11T15:28:44Z","title":"Accurate Use of Label Dependency in Multi-Label Text Classification\n Through the Lens of Causality","summary":" Multi-Label Text Classification (MLTC) aims to assign the most relevant\nlabels to each given text. Existing methods demonstrate that label dependency\ncan help to improve the model's performance. However, the introduction of label\ndependency may cause the model to suffer from unwanted prediction bias. In this\nstudy, we attribute the bias to the model's misuse of label dependency, i.e.,\nthe model tends to utilize the correlation shortcut in label dependency rather\nthan fusing text information and label dependency for prediction. Motivated by\ncausal inference, we propose a CounterFactual Text Classifier (CFTC) to\neliminate the correlation bias, and make causality-based predictions.\nSpecifically, our CFTC first adopts the predict-then-modify backbone to extract\nprecise label information embedded in label dependency, then blocks the\ncorrelation shortcut through the counterfactual de-bias technique with the help\nof the human causal graph. Experimental results on three datasets demonstrate\nthat our CFTC significantly outperforms the baselines and effectively\neliminates the correlation bias in datasets.\n","authors":["Caoyun Fan","Wenqing Chen","Jidong Tian","Yitian Li","Hao He","Yaohui Jin"],"pdf_url":"https://arxiv.org/pdf/2310.07588v1.pdf","comment":"Applied Intelligence 2023"},{"id":"http://arxiv.org/abs/2310.07587v1","updated":"2023-10-11T15:28:39Z","published":"2023-10-11T15:28:39Z","title":"Fed-GraB: Federated Long-tailed Learning with Self-Adjusting Gradient\n Balancer","summary":" Data privacy and long-tailed distribution are the norms rather than the\nexception in many real-world tasks. This paper investigates a federated\nlong-tailed learning (Fed-LT) task in which each client holds a locally\nheterogeneous dataset; if the datasets can be globally aggregated, they jointly\nexhibit a long-tailed distribution. Under such a setting, existing federated\noptimization and/or centralized long-tailed learning methods hardly apply due\nto challenges in (a) characterizing the global long-tailed distribution under\nprivacy constraints and (b) adjusting the local learning strategy to cope with\nthe head-tail imbalance. In response, we propose a method termed\n$\\texttt{Fed-GraB}$, comprised of a Self-adjusting Gradient Balancer (SGB)\nmodule that re-weights clients' gradients in a closed-loop manner, based on the\nfeedback of global long-tailed distribution evaluated by a Direct Prior\nAnalyzer (DPA) module. Using $\\texttt{Fed-GraB}$, clients can effectively\nalleviate the distribution drift caused by data heterogeneity during the model\ntraining process and obtain a global model with better performance on the\nminority classes while maintaining the performance of the majority classes.\nExtensive experiments demonstrate that $\\texttt{Fed-GraB}$ achieves\nstate-of-the-art performance on representative datasets such as CIFAR-10-LT,\nCIFAR-100-LT, ImageNet-LT, and iNaturalist.\n","authors":["Zikai Xiao","Zihan Chen","Songshang Liu","Hualiang Wang","Yang Feng","Jin Hao","Joey Tianyi Zhou","Jian Wu","Howard Hao Yang","Zuozhu Liu"],"pdf_url":"https://arxiv.org/pdf/2310.07587v1.pdf","comment":"Accepted by NeurIPS 2023"},{"id":"http://arxiv.org/abs/2309.11942v2","updated":"2023-10-11T15:21:32Z","published":"2023-09-21T09:57:03Z","title":"On the Probability of Immunity","summary":" This work is devoted to the study of the probability of immunity, i.e. the\neffect occurs whether exposed or not. We derive necessary and sufficient\nconditions for non-immunity and $\\epsilon$-bounded immunity, i.e. the\nprobability of immunity is zero and $\\epsilon$-bounded, respectively. The\nformer allows us to estimate the probability of benefit (i.e., the effect\noccurs if and only if exposed) from a randomized controlled trial, and the\nlatter allows us to produce bounds of the probability of benefit that are\ntighter than the existing ones. We also introduce the concept of indirect\nimmunity (i.e., through a mediator) and repeat our previous analysis for it.\nFinally, we propose a method for sensitivity analysis of the probability of\nimmunity under unmeasured confounding.\n","authors":["Jose M. Peña"],"pdf_url":"https://arxiv.org/pdf/2309.11942v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07582v1","updated":"2023-10-11T15:20:07Z","published":"2023-10-11T15:20:07Z","title":"Linear Latent World Models in Simple Transformers: A Case Study on\n Othello-GPT","summary":" Foundation models exhibit significant capabilities in decision-making and\nlogical deductions. Nonetheless, a continuing discourse persists regarding\ntheir genuine understanding of the world as opposed to mere stochastic mimicry.\nThis paper meticulously examines a simple transformer trained for Othello,\nextending prior research to enhance comprehension of the emergent world model\nof Othello-GPT. The investigation reveals that Othello-GPT encapsulates a\nlinear representation of opposing pieces, a factor that causally steers its\ndecision-making process. This paper further elucidates the interplay between\nthe linear world representation and causal decision-making, and their\ndependence on layer depth and model complexity. We have made the code public.\n","authors":["Dean S. Hazineh","Zechen Zhang","Jeffery Chiu"],"pdf_url":"https://arxiv.org/pdf/2310.07582v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07579v1","updated":"2023-10-11T15:19:31Z","published":"2023-10-11T15:19:31Z","title":"In-Context Unlearning: Language Models as Few Shot Unlearners","summary":" Machine unlearning, the study of efficiently removing the impact of specific\ntraining points on the trained model, has garnered increased attention of late,\ndriven by the need to comply with privacy regulations like the \\emph{Right to\nbe Forgotten}. Although unlearning is particularly relevant for LLMs in light\nof the copyright issues they raise, achieving precise unlearning is\ncomputationally infeasible for very large models. To this end, recent work has\nproposed several algorithms which approximate the removal of training data\nwithout retraining the model. These algorithms crucially rely on access to the\nmodel parameters in order to update them, an assumption that may not hold in\npractice due to computational constraints or when the LLM is accessed via API.\nIn this work, we propose a new class of unlearning methods for LLMs we call\n``In-Context Unlearning'', providing inputs in context and without having to\nupdate model parameters. To unlearn a particular training instance, we provide\nthe instance alongside a flipped label and additional correctly labelled\ninstances which are prepended as inputs to the LLM at inference time. Our\nexperimental results demonstrate that these contexts effectively remove\nspecific information from the training set while maintaining performance levels\nthat are competitive with (or in some cases exceed) state-of-the-art unlearning\nmethods that require access to the LLM parameters.\n","authors":["Martin Pawelczyk","Seth Neel","Himabindu Lakkaraju"],"pdf_url":"https://arxiv.org/pdf/2310.07579v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07576v1","updated":"2023-10-11T15:17:55Z","published":"2023-10-11T15:17:55Z","title":"Analyzing Trendy Twitter Hashtags in the 2022 French Election","summary":" Regressions trained to predict the future activity of social media users need\nrich features for accurate predictions. Many advanced models exist to generate\nsuch features; however, the time complexities of their computations are often\nprohibitive when they run on enormous data-sets. Some studies have shown that\nsimple semantic network features can be rich enough to use for regressions\nwithout requiring complex computations. We propose a method for using semantic\nnetworks as user-level features for machine learning tasks. We conducted an\nexperiment using a semantic network of 1037 Twitter hashtags from a corpus of\n3.7 million tweets related to the 2022 French presidential election. A\nbipartite graph is formed where hashtags are nodes and weighted edges connect\nthe hashtags reflecting the number of Twitter users that interacted with both\nhashtags. The graph is then transformed into a maximum-spanning tree with the\nmost popular hashtag as its root node to construct a hierarchy amongst the\nhashtags. We then provide a vector feature for each user based on this tree. To\nvalidate the usefulness of our semantic feature we performed a regression\nexperiment to predict the response rate of each user with six emotions like\nanger, enjoyment, or disgust. Our semantic feature performs well with the\nregression with most emotions having $R^2$ above 0.5. These results suggest\nthat our semantic feature could be considered for use in further experiments\npredicting social media response on big data-sets.\n","authors":["Aamir Mandviwalla","Lake Yin","Boleslaw K. Szymanski"],"pdf_url":"https://arxiv.org/pdf/2310.07576v1.pdf","comment":"9 pages, 1 figure, to be published in Complex Networks 2023"},{"id":"http://arxiv.org/abs/2211.15751v3","updated":"2023-10-11T15:13:02Z","published":"2022-11-28T20:11:37Z","title":"Edge Video Analytics: A Survey on Applications, Systems and Enabling\n Techniques","summary":" Video, as a key driver in the global explosion of digital information, can\ncreate tremendous benefits for human society. Governments and enterprises are\ndeploying innumerable cameras for a variety of applications, e.g., law\nenforcement, emergency management, traffic control, and security surveillance,\nall facilitated by video analytics (VA). This trend is spurred by the rapid\nadvancement of deep learning (DL), which enables more precise models for object\nclassification, detection, and tracking. Meanwhile, with the proliferation of\nInternet-connected devices, massive amounts of data are generated daily,\noverwhelming the cloud. Edge computing, an emerging paradigm that moves\nworkloads and services from the network core to the network edge, has been\nwidely recognized as a promising solution. The resulting new intersection, edge\nvideo analytics (EVA), begins to attract widespread attention. Nevertheless,\nonly a few loosely-related surveys exist on this topic. The basic concepts of\nEVA (e.g., definition, architectures) were not fully elucidated due to the\nrapid development of this domain. To fill these gaps, we provide a\ncomprehensive survey of the recent efforts on EVA. In this paper, we first\nreview the fundamentals of edge computing, followed by an overview of VA. EVA\nsystems and their enabling techniques are discussed next. In addition, we\nintroduce prevalent frameworks and datasets to aid future researchers in the\ndevelopment of EVA systems. Finally, we discuss existing challenges and foresee\nfuture research directions. We believe this survey will help readers comprehend\nthe relationship between VA and edge computing, and spark new ideas on EVA.\n","authors":["Renjie Xu","Saiedeh Razavi","Rong Zheng"],"pdf_url":"https://arxiv.org/pdf/2211.15751v3.pdf","comment":"Accepted in IEEE Communications Surveys and Tutorials, 2023"},{"id":"http://arxiv.org/abs/2304.06548v2","updated":"2023-10-11T15:09:11Z","published":"2023-04-13T13:57:33Z","title":"Multi-kernel Correntropy-based Orientation Estimation of IMUs: Gradient\n Descent Methods","summary":" This paper presents two computationally efficient algorithms for the\norientation estimation of inertial measurement units (IMUs): the\ncorrentropy-based gradient descent (CGD) and the correntropy-based decoupled\norientation estimation (CDOE). Traditional methods, such as gradient descent\n(GD) and decoupled orientation estimation (DOE), rely on the mean squared error\n(MSE) criterion, making them vulnerable to external acceleration and magnetic\ninterference. To address this issue, we demonstrate that the multi-kernel\ncorrentropy loss (MKCL) is an optimal objective function for maximum likelihood\nestimation (MLE) when the noise follows a type of heavy-tailed distribution. In\ncertain situations, the estimation error of the MKCL is bounded even in the\npresence of arbitrarily large outliers. By replacing the standard MSE cost\nfunction with MKCL, we develop the CGD and CDOE algorithms. We evaluate the\neffectiveness of our proposed methods by comparing them with existing\nalgorithms in various situations. Experimental results indicate that our\nproposed methods (CGD and CDOE) outperform their conventional counterparts (GD\nand DOE), especially when faced with external acceleration and magnetic\ndisturbances. Furthermore, the new algorithms demonstrate significantly lower\ncomputational complexity than Kalman filter-based approaches, making them\nsuitable for applications with low-cost microprocessors.\n","authors":["Shilei Li","Lijing Li","Dawei Shi","Yunjiang Lou","Ling Shi"],"pdf_url":"https://arxiv.org/pdf/2304.06548v2.pdf","comment":"16 pages"},{"id":"http://arxiv.org/abs/2304.14933v2","updated":"2023-10-11T15:08:51Z","published":"2023-04-28T15:43:21Z","title":"An Empirical Study of Multimodal Model Merging","summary":" Model merging (e.g., via interpolation or task arithmetic) fuses multiple\nmodels trained on different tasks to generate a multi-task solution. The\ntechnique has been proven successful in previous studies, where the models are\ntrained on similar tasks and with the same initialization. In this paper, we\nexpand on this concept to a multimodal setup by merging transformers trained on\ndifferent modalities. Furthermore, we conduct our study for a novel goal where\nwe can merge vision, language, and cross-modal transformers of a\nmodality-specific architecture to create a parameter-efficient\nmodality-agnostic architecture. Through comprehensive experiments, we\nsystematically investigate the key factors impacting model performance after\nmerging, including initialization, merging mechanisms, and model architectures.\nWe also propose two metrics that assess the distance between weights to be\nmerged and can serve as an indicator of the merging outcomes. Our analysis\nleads to an effective training recipe for matching the performance of the\nmodality-agnostic baseline (i.e., pre-trained from scratch) via model merging.\nOur method also outperforms naive merging significantly on various tasks, with\nimprovements of 3% on VQA, 7% on COCO retrieval, 25% on NLVR2, 14% on Flickr30k\nand 3% on ADE20k. Our code is available at https://github.com/ylsung/vl-merging\n","authors":["Yi-Lin Sung","Linjie Li","Kevin Lin","Zhe Gan","Mohit Bansal","Lijuan Wang"],"pdf_url":"https://arxiv.org/pdf/2304.14933v2.pdf","comment":"EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2310.05469v2","updated":"2023-10-11T15:07:36Z","published":"2023-10-09T07:26:35Z","title":"Vibroacoustic Frequency Response Prediction with Query-based Operator\n Networks","summary":" Understanding vibroacoustic wave propagation in mechanical structures like\nairplanes, cars and houses is crucial to ensure health and comfort of their\nusers. To analyze such systems, designers and engineers primarily consider the\ndynamic response in the frequency domain, which is computed through expensive\nnumerical simulations like the finite element method. In contrast, data-driven\nsurrogate models offer the promise of speeding up these simulations, thereby\nfacilitating tasks like design optimization, uncertainty quantification, and\ndesign space exploration. We present a structured benchmark for a\nrepresentative vibroacoustic problem: Predicting the frequency response for\nvibrating plates with varying forms of beadings. The benchmark features a total\nof 12,000 plate geometries with an associated numerical solution and introduces\nevaluation metrics to quantify the prediction quality. To address the frequency\nresponse prediction task, we propose a novel frequency query operator model,\nwhich is trained to map plate geometries to frequency response functions. By\nintegrating principles from operator learning and implicit models for shape\nencoding, our approach effectively addresses the prediction of resonance peaks\nof frequency responses. We evaluate the method on our vibrating-plates\nbenchmark and find that it outperforms DeepONets, Fourier Neural Operators and\nmore traditional neural network architectures. The code and dataset are\navailable from https://eckerlab.org/code/delden2023_plate.\n","authors":["Jan van Delden","Julius Schultz","Christopher Blech","Sabine C. Langer","Timo Lüddecke"],"pdf_url":"https://arxiv.org/pdf/2310.05469v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07560v1","updated":"2023-10-11T15:04:33Z","published":"2023-10-11T15:04:33Z","title":"ROMO: Retrieval-enhanced Offline Model-based Optimization","summary":" Data-driven black-box model-based optimization (MBO) problems arise in a\ngreat number of practical application scenarios, where the goal is to find a\ndesign over the whole space maximizing a black-box target function based on a\nstatic offline dataset. In this work, we consider a more general but\nchallenging MBO setting, named constrained MBO (CoMBO), where only part of the\ndesign space can be optimized while the rest is constrained by the environment.\nA new challenge arising from CoMBO is that most observed designs that satisfy\nthe constraints are mediocre in evaluation. Therefore, we focus on optimizing\nthese mediocre designs in the offline dataset while maintaining the given\nconstraints rather than further boosting the best observed design in the\ntraditional MBO setting. We propose retrieval-enhanced offline model-based\noptimization (ROMO), a new derivable forward approach that retrieves the\noffline dataset and aggregates relevant samples to provide a trusted\nprediction, and use it for gradient-based optimization. ROMO is simple to\nimplement and outperforms state-of-the-art approaches in the CoMBO setting.\nEmpirically, we conduct experiments on a synthetic Hartmann (3D) function\ndataset, an industrial CIO dataset, and a suite of modified tasks in the\nDesign-Bench benchmark. Results show that ROMO performs well in a wide range of\nconstrained optimization tasks.\n","authors":["Mingcheng Chen","Haoran Zhao","Yuxiang Zhao","Hulei Fan","Hongqiao Gao","Yong Yu","Zheng Tian"],"pdf_url":"https://arxiv.org/pdf/2310.07560v1.pdf","comment":"15 pages, 9 figures"},{"id":"http://arxiv.org/abs/2310.07558v1","updated":"2023-10-11T15:02:13Z","published":"2023-10-11T15:02:13Z","title":"Smootheness-Adaptive Dynamic Pricing with Nonparametric Demand Learning","summary":" We study the dynamic pricing problem where the demand function is\nnonparametric and H\\\"older smooth, and we focus on adaptivity to the unknown\nH\\\"older smoothness parameter $\\beta$ of the demand function. Traditionally the\noptimal dynamic pricing algorithm heavily relies on the knowledge of $\\beta$ to\nachieve a minimax optimal regret of\n$\\widetilde{O}(T^{\\frac{\\beta+1}{2\\beta+1}})$. However, we highlight the\nchallenge of adaptivity in this dynamic pricing problem by proving that no\npricing policy can adaptively achieve this minimax optimal regret without\nknowledge of $\\beta$. Motivated by the impossibility result, we propose a\nself-similarity condition to enable adaptivity. Importantly, we show that the\nself-similarity condition does not compromise the problem's inherent complexity\nsince it preserves the regret lower bound\n$\\Omega(T^{\\frac{\\beta+1}{2\\beta+1}})$. Furthermore, we develop a\nsmoothness-adaptive dynamic pricing algorithm and theoretically prove that the\nalgorithm achieves this minimax optimal regret bound without the prior\nknowledge $\\beta$.\n","authors":["Zeqi Ye","Hansheng Jiang"],"pdf_url":"https://arxiv.org/pdf/2310.07558v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.12898v3","updated":"2023-10-11T14:59:49Z","published":"2023-06-22T14:07:23Z","title":"Machine-Learning-Assisted and Real-Time-Feedback-Controlled Growth of\n InAs/GaAs Quantum Dots","summary":" Self-assembled InAs/GaAs quantum dots (QDs) have properties highly valuable\nfor developing various optoelectronic devices such as QD lasers and single\nphoton sources. The applications strongly rely on the density and quality of\nthese dots, which has motivated studies of the growth process control to\nrealize high-quality epi-wafers and devices. Establishing the process\nparameters in molecular beam epitaxy (MBE) for a specific density of QDs is a\nmultidimensional optimization challenge, usually addressed through\ntime-consuming and iterative trial-and-error. Here, we report a real-time\nfeedback control method to realize the growth of QDs with arbitrary density,\nwhich is fully automated and intelligent. We developed a machine learning (ML)\nmodel named 3D ResNet 50 trained using reflection high-energy electron\ndiffraction (RHEED) videos as input instead of static images and providing\nreal-time feedback on surface morphologies for process control. As a result, we\ndemonstrated that ML from previous growth could predict the post-growth density\nof QDs, by successfully tuning the QD densities in near-real time from 1.5E10\ncm-2 down to 3.8E8 cm-2 or up to 1.4E11 cm-2. Compared to traditional methods,\nour approach, with in situ tuning capabilities and excellent reliability, can\ndramatically expedite the material optimization process and improve the\nreproducibility of MBE, constituting significant progress for thin film growth\ntechniques. The concepts and methodologies proved feasible in this work are\npromising to be applied to a variety of material growth processes, which will\nrevolutionize semiconductor manufacturing for optoelectronic and\nmicroelectronic industries.\n","authors":["Chao Shen","Wenkang Zhan","Kaiyao Xin","Manyang Li","Zhenyu Sun","Hui Cong","Chi Xu","Jian Tang","Zhaofeng Wu","Bo Xu","Zhongming Wei","Chunlai Xue","Chao Zhao","Zhanguo Wang"],"pdf_url":"https://arxiv.org/pdf/2306.12898v3.pdf","comment":"5 figures"},{"id":"http://arxiv.org/abs/2308.12067v2","updated":"2023-10-11T14:49:26Z","published":"2023-08-23T11:27:30Z","title":"InstructionGPT-4: A 200-Instruction Paradigm for Fine-Tuning MiniGPT-4","summary":" Multimodal large language models are typically trained in two stages: first\npre-training on image-text pairs, and then fine-tuning using supervised\nvision-language instruction data. Recent studies have shown that large language\nmodels can achieve satisfactory results even with a limited amount of\nhigh-quality instruction-following data. In this paper, we introduce\nInstructionGPT-4, which is fine-tuned on a small dataset comprising only 200\nexamples, amounting to approximately 6\\% of the instruction-following data used\nin the alignment dataset for MiniGPT-4. To achieve this, we first propose\nseveral metrics to access the quality of multimodal instruction data. Based on\nthese metrics, we present an effective and trainable data selector to\nautomatically identify and filter low-quality vision-language data. By\nemploying this method, InstructionGPT-4 outperforms the original MiniGPT-4 on\nvarious evaluations. Overall, our findings demonstrate that less but\nhigh-quality instruction tuning data is efficient in enabling multimodal large\nlanguage models to generate better output. Our code is available at\nhttps://github.com/waltonfuture/InstructionGPT-4.\n","authors":["Lai Wei","Zihao Jiang","Weiran Huang","Lichao Sun"],"pdf_url":"https://arxiv.org/pdf/2308.12067v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07535v1","updated":"2023-10-11T14:39:51Z","published":"2023-10-11T14:39:51Z","title":"Improving Fairness-Accuracy tradeoff with few Test Samples under\n Covariate Shift","summary":" Covariate shift in the test data can significantly downgrade both the\naccuracy and the fairness performance of the model. Ensuring fairness across\ndifferent sensitive groups in such settings is of paramount importance due to\nsocietal implications like criminal justice. We operate under the unsupervised\nregime where only a small set of unlabeled test samples along with a labeled\ntraining set is available. Towards this problem, we make three contributions.\nFirst is a novel composite weighted entropy based objective for prediction\naccuracy which is optimized along with a representation matching loss for\nfairness. We experimentally verify that optimizing with our loss formulation\noutperforms a number of state-of-the-art baselines in the pareto sense with\nrespect to the fairness-accuracy tradeoff on several standard datasets. Our\nsecond contribution is a new setting we term Asymmetric Covariate Shift that,\nto the best of our knowledge, has not been studied before. Asymmetric covariate\nshift occurs when distribution of covariates of one group shifts significantly\ncompared to the other groups and this happens when a dominant group is\nover-represented. While this setting is extremely challenging for current\nbaselines, We show that our proposed method significantly outperforms them. Our\nthird contribution is theoretical, where we show that our weighted entropy term\nalong with prediction loss on the training set approximates test loss under\ncovariate shift. Empirically and through formal sample complexity bounds, we\nshow that this approximation to the unseen test loss does not depend on\nimportance sampling variance which affects many other baselines.\n","authors":["Shreyas Havaldar","Jatin Chauhan","Karthikeyan Shanmugam","Jay Nandy","Aravindan Raghuveer"],"pdf_url":"https://arxiv.org/pdf/2310.07535v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07534v1","updated":"2023-10-11T14:39:12Z","published":"2023-10-11T14:39:12Z","title":"Human-Centered Evaluation of XAI Methods","summary":" In the ever-evolving field of Artificial Intelligence, a critical challenge\nhas been to decipher the decision-making processes within the so-called \"black\nboxes\" in deep learning. Over recent years, a plethora of methods have emerged,\ndedicated to explaining decisions across diverse tasks. Particularly in tasks\nlike image classification, these methods typically identify and emphasize the\npivotal pixels that most influence a classifier's prediction. Interestingly,\nthis approach mirrors human behavior: when asked to explain our rationale for\nclassifying an image, we often point to the most salient features or aspects.\nCapitalizing on this parallel, our research embarked on a user-centric study.\nWe sought to objectively measure the interpretability of three leading\nexplanation methods: (1) Prototypical Part Network, (2) Occlusion, and (3)\nLayer-wise Relevance Propagation. Intriguingly, our results highlight that\nwhile the regions spotlighted by these methods can vary widely, they all offer\nhumans a nearly equivalent depth of understanding. This enables users to\ndiscern and categorize images efficiently, reinforcing the value of these\nmethods in enhancing AI transparency.\n","authors":["Karam Dawoud","Wojciech Samek","Sebastian Lapuschkin","Sebastian Bosse"],"pdf_url":"https://arxiv.org/pdf/2310.07534v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.13149v3","updated":"2023-10-11T14:32:34Z","published":"2023-07-24T22:22:32Z","title":"Discovering interpretable elastoplasticity models via the neural\n polynomial method enabled symbolic regressions","summary":" Conventional neural network elastoplasticity models are often perceived as\nlacking interpretability. This paper introduces a two-step machine learning\napproach that returns mathematical models interpretable by human experts. In\nparticular, we introduce a surrogate model where yield surfaces are expressed\nin terms of a set of single-variable feature mappings obtained from supervised\nlearning. A postprocessing step is then used to re-interpret the set of\nsingle-variable neural network mapping functions into mathematical form through\nsymbolic regression. This divide-and-conquer approach provides several\nimportant advantages. First, it enables us to overcome the scaling issue of\nsymbolic regression algorithms. From a practical perspective, it enhances the\nportability of learned models for partial differential equation solvers written\nin different programming languages. Finally, it enables us to have a concrete\nunderstanding of the attributes of the materials, such as convexity and\nsymmetries of models, through automated derivations and reasoning. Numerical\nexamples have been provided, along with an open-source code to enable third\nparty validation.\n","authors":["Bahador Bahmani","Hyoung Suk Suh","WaiChing Sun"],"pdf_url":"https://arxiv.org/pdf/2307.13149v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07528v1","updated":"2023-10-11T14:29:11Z","published":"2023-10-11T14:29:11Z","title":"Provable Advantage of Parameterized Quantum Circuit in Function\n Approximation","summary":" Understanding the power of parameterized quantum circuits (PQCs) in\naccomplishing machine learning tasks is one of the most important questions in\nquantum machine learning. In this paper, we analyze the expressivity of PQCs\nthrough the lens of function approximation. Previously established universal\napproximation theorems for PQCs are mainly nonconstructive, leading us to the\nfollowing question: How large do the PQCs need to be to approximate the target\nfunction up to a given error? We exhibit explicit constructions of data\nre-uploading PQCs for approximating continuous and smooth functions and\nestablish quantitative approximation error bounds in terms of the width, the\ndepth and the number of trainable parameters of the PQCs. To achieve this, we\nutilize techniques from quantum signal processing and linear combinations of\nunitaries to construct PQCs that implement multivariate polynomials. We\nimplement global and local approximation techniques using Bernstein polynomials\nand local Taylor expansion and analyze their performances in the quantum\nsetting. We also compare our proposed PQCs to nearly optimal deep neural\nnetworks in approximating high-dimensional smooth functions, showing that the\nratio between model sizes of PQC and deep neural networks is exponentially\nsmall with respect to the input dimension. This suggests a potentially novel\navenue for showcasing quantum advantages in quantum machine learning.\n","authors":["Zhan Yu","Qiuhao Chen","Yuling Jiao","Yinan Li","Xiliang Lu","Xin Wang","Jerry Zhijian Yang"],"pdf_url":"https://arxiv.org/pdf/2310.07528v1.pdf","comment":"31pages, 3 figures"},{"id":"http://arxiv.org/abs/2304.01950v2","updated":"2023-10-11T14:21:29Z","published":"2023-04-01T09:16:40Z","title":"MP-FedCL: Multiprototype Federated Contrastive Learning for Edge\n Intelligence","summary":" Federated learning-assisted edge intelligence enables privacy protection in\nmodern intelligent services. However, not independent and identically\ndistributed (non-IID) distribution among edge clients can impair the local\nmodel performance. The existing single prototype-based strategy represents a\nclass by using the mean of the feature space. However, feature spaces are\nusually not clustered, and a single prototype may not represent a class well.\nMotivated by this, this paper proposes a multi-prototype federated contrastive\nlearning approach (MP-FedCL) which demonstrates the effectiveness of using a\nmulti-prototype strategy over a single-prototype under non-IID settings,\nincluding both label and feature skewness. Specifically, a multi-prototype\ncomputation strategy based on \\textit{k-means} is first proposed to capture\ndifferent embedding representations for each class space, using multiple\nprototypes ($k$ centroids) to represent a class in the embedding space. In each\nglobal round, the computed multiple prototypes and their respective model\nparameters are sent to the edge server for aggregation into a global prototype\npool, which is then sent back to all clients to guide their local training.\nFinally, local training for each client minimizes their own supervised learning\ntasks and learns from shared prototypes in the global prototype pool through\nsupervised contrastive learning, which encourages them to learn knowledge\nrelated to their own class from others and reduces the absorption of unrelated\nknowledge in each global iteration. Experimental results on MNIST, Digit-5,\nOffice-10, and DomainNet show that our method outperforms multiple baselines,\nwith an average test accuracy improvement of about 4.6\\% and 10.4\\% under\nfeature and label non-IID distributions, respectively.\n","authors":["Yu Qiao","Md. Shirajum Munir","Apurba Adhikary","Huy Q. Le","Avi Deb Raha","Chaoning Zhang","Choong Seon Hong"],"pdf_url":"https://arxiv.org/pdf/2304.01950v2.pdf","comment":"Accepted by IEEE Internet of Things"},{"id":"http://arxiv.org/abs/2111.08117v3","updated":"2023-10-11T14:20:40Z","published":"2021-11-15T22:33:52Z","title":"Neural networks with linear threshold activations: structure and\n algorithms","summary":" In this article we present new results on neural networks with linear\nthreshold activation functions. We precisely characterize the class of\nfunctions that are representable by such neural networks and show that 2 hidden\nlayers are necessary and sufficient to represent any function representable in\nthe class. This is a surprising result in the light of recent exact\nrepresentability investigations for neural networks using other popular\nactivation functions like rectified linear units (ReLU). We also give precise\nbounds on the sizes of the neural networks required to represent any function\nin the class. Finally, we design an algorithm to solve the empirical risk\nminimization (ERM) problem to global optimality for these neural networks with\na fixed architecture. The algorithm's running time is polynomial in the size of\nthe data sample, if the input dimension and the size of the network\narchitecture are considered fixed constants. The algorithm is unique in the\nsense that it works for any architecture with any number of layers, whereas\nprevious polynomial time globally optimal algorithms work only for very\nrestricted classes of architectures. Using these insights, we propose a new\nclass of neural networks that we call shortcut linear threshold networks. To\nthe best of our knowledge, this way of designing neural networks has not been\nexplored before in the literature. We show that these neural networks have\nseveral desirable theoretical properties.\n","authors":["Sammy Khalife","Hongyu Cheng","Amitabh Basu"],"pdf_url":"https://arxiv.org/pdf/2111.08117v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07518v1","updated":"2023-10-11T14:16:04Z","published":"2023-10-11T14:16:04Z","title":"Exploiting Causal Graph Priors with Posterior Sampling for Reinforcement\n Learning","summary":" Posterior sampling allows the exploitation of prior knowledge of the\nenvironment's transition dynamics to improve the sample efficiency of\nreinforcement learning. The prior is typically specified as a class of\nparametric distributions, a task that can be cumbersome in practice, often\nresulting in the choice of uninformative priors. In this work, we propose a\nnovel posterior sampling approach in which the prior is given as a (partial)\ncausal graph over the environment's variables. The latter is often more natural\nto design, such as listing known causal dependencies between biometric features\nin a medical treatment study. Specifically, we propose a hierarchical Bayesian\nprocedure, called C-PSRL, simultaneously learning the full causal graph at the\nhigher level and the parameters of the resulting factored dynamics at the lower\nlevel. For this procedure, we provide an analysis of its Bayesian regret, which\nexplicitly connects the regret rate with the degree of prior knowledge. Our\nnumerical evaluation conducted in illustrative domains confirms that C-PSRL\nstrongly improves the efficiency of posterior sampling with an uninformative\nprior while performing close to posterior sampling with the full causal graph.\n","authors":["Mirco Mutti","Riccardo De Santi","Marcello Restelli","Alexander Marx","Giorgia Ramponi"],"pdf_url":"https://arxiv.org/pdf/2310.07518v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07511v1","updated":"2023-10-11T14:07:05Z","published":"2023-10-11T14:07:05Z","title":"A Unified Remote Sensing Anomaly Detector Across Modalities and Scenes\n via Deviation Relationship Learning","summary":" Remote sensing anomaly detector can find the objects deviating from the\nbackground as potential targets. Given the diversity in earth anomaly types, a\nunified anomaly detector across modalities and scenes should be cost-effective\nand flexible to new earth observation sources and anomaly types. However, the\ncurrent anomaly detectors are limited to a single modality and single scene,\nsince they aim to learn the varying background distribution. Motivated by the\nuniversal anomaly deviation pattern, in that anomalies exhibit deviations from\ntheir local context, we exploit this characteristic to build a unified anomaly\ndetector. Firstly, we reformulate the anomaly detection task as an undirected\nbilayer graph based on the deviation relationship, where the anomaly score is\nmodeled as the conditional probability, given the pattern of the background and\nnormal objects. The learning objective is then expressed as a conditional\nprobability ranking problem. Furthermore, we design an instantiation of the\nreformulation in the data, architecture, and optimization aspects. Simulated\nspectral and spatial anomalies drive the instantiated architecture. The model\nis optimized directly for the conditional probability ranking. The proposed\nmodel was validated in five modalities including the hyperspectral, visible\nlight, synthetic aperture radar (SAR), infrared and low light to show its\nunified detection ability.\n","authors":["Jingtao Li","Xinyu Wang","Hengwei Zhao","Liangpei Zhang","Yanfei Zhong"],"pdf_url":"https://arxiv.org/pdf/2310.07511v1.pdf","comment":"Journal paper"},{"id":"http://arxiv.org/abs/2310.07506v1","updated":"2023-10-11T14:02:11Z","published":"2023-10-11T14:02:11Z","title":"Leveraging Hierarchical Feature Sharing for Efficient Dataset\n Condensation","summary":" Given a real-world dataset, data condensation (DC) aims to synthesize a\nsignificantly smaller dataset that captures the knowledge of this dataset for\nmodel training with high performance. Recent works propose to enhance DC with\ndata parameterization, which condenses data into parameterized data containers\nrather than pixel space. The intuition behind data parameterization is to\nencode shared features of images to avoid additional storage costs. In this\npaper, we recognize that images share common features in a hierarchical way due\nto the inherent hierarchical structure of the classification system, which is\noverlooked by current data parameterization methods. To better align DC with\nthis hierarchical nature and encourage more efficient information sharing\ninside data containers, we propose a novel data parameterization architecture,\nHierarchical Memory Network (HMN). HMN stores condensed data in a three-tier\nstructure, representing the dataset-level, class-level, and instance-level\nfeatures. Another helpful property of the hierarchical architecture is that HMN\nnaturally ensures good independence among images despite achieving information\nsharing. This enables instance-level pruning for HMN to reduce redundant\ninformation, thereby further minimizing redundancy and enhancing performance.\nWe evaluate HMN on four public datasets (SVHN, CIFAR10, CIFAR100, and\nTiny-ImageNet) and compare HMN with eight DC baselines. The evaluation results\nshow that our proposed method outperforms all baselines, even when trained with\na batch-based loss consuming less GPU memory.\n","authors":["Haizhong Zheng","Jiachen Sun","Shutong Wu","Bhavya Kailkhura","Zhuoqing Mao","Chaowei Xiao","Atul Prakash"],"pdf_url":"https://arxiv.org/pdf/2310.07506v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07497v1","updated":"2023-10-11T13:50:28Z","published":"2023-10-11T13:50:28Z","title":"Sample-Driven Federated Learning for Energy-Efficient and Real-Time IoT\n Sensing","summary":" In the domain of Federated Learning (FL) systems, recent cutting-edge methods\nheavily rely on ideal conditions convergence analysis. Specifically, these\napproaches assume that the training datasets on IoT devices possess similar\nattributes to the global data distribution. However, this approach fails to\ncapture the full spectrum of data characteristics in real-time sensing FL\nsystems. In order to overcome this limitation, we suggest a new approach system\nspecifically designed for IoT networks with real-time sensing capabilities. Our\napproach takes into account the generalization gap due to the user's data\nsampling process. By effectively controlling this sampling process, we can\nmitigate the overfitting issue and improve overall accuracy. In particular, We\nfirst formulate an optimization problem that harnesses the sampling process to\nconcurrently reduce overfitting while maximizing accuracy. In pursuit of this\nobjective, our surrogate optimization problem is adept at handling energy\nefficiency while optimizing the accuracy with high generalization. To solve the\noptimization problem with high complexity, we introduce an online reinforcement\nlearning algorithm, named Sample-driven Control for Federated Learning (SCFL)\nbuilt on the Soft Actor-Critic (A2C) framework. This enables the agent to\ndynamically adapt and find the global optima even in changing environments. By\nleveraging the capabilities of SCFL, our system offers a promising solution for\nresource allocation in FL systems with real-time sensing capabilities.\n","authors":["Minh Ngoc Luu","Minh-Duong Nguyen","Ebrahim Bedeer","Van Duc Nguyen","Dinh Thai Hoang","Diep N. Nguyen","Quoc-Viet Pham"],"pdf_url":"https://arxiv.org/pdf/2310.07497v1.pdf","comment":"17 pages, 5 figures"},{"id":"http://arxiv.org/abs/2310.07491v1","updated":"2023-10-11T13:39:04Z","published":"2023-10-11T13:39:04Z","title":"Model-based Clustering of Individuals' Ecological Momentary Assessment\n Time-series Data for Improving Forecasting Performance","summary":" Through Ecological Momentary Assessment (EMA) studies, a number of\ntime-series data is collected across multiple individuals, continuously\nmonitoring various items of emotional behavior. Such complex data is commonly\nanalyzed in an individual level, using personalized models. However, it is\nbelieved that additional information of similar individuals is likely to\nenhance these models leading to better individuals' description. Thus,\nclustering is investigated with an aim to group together the most similar\nindividuals, and subsequently use this information in group-based models in\norder to improve individuals' predictive performance. More specifically, two\nmodel-based clustering approaches are examined, where the first is using\nmodel-extracted parameters of personalized models, whereas the second is\noptimized on the model-based forecasting performance. Both methods are then\nanalyzed using intrinsic clustering evaluation measures (e.g. Silhouette\ncoefficients) as well as the performance of a downstream forecasting scheme,\nwhere each forecasting group-model is devoted to describe all individuals\nbelonging to one cluster. Among these, clustering based on performance shows\nthe best results, in terms of all examined evaluation measures. As another\nlevel of evaluation, those group-models' performance is compared to three\nbaseline scenarios, the personalized, the all-in-one group and the random\ngroup-based concept. According to this comparison, the superiority of\nclustering-based methods is again confirmed, indicating that the utilization of\ngroup-based information could be effectively enhance the overall performance of\nall individuals' data.\n","authors":["Mandani Ntekouli","Gerasimos Spanakis","Lourens Waldorp","Anne Roefs"],"pdf_url":"https://arxiv.org/pdf/2310.07491v1.pdf","comment":"17 pages, 7 figures, BNAIC/BeNeLearn 2023 (Joint International\n Scientific Conferences on AI and Machine Learning)"},{"id":"http://arxiv.org/abs/2310.07488v1","updated":"2023-10-11T13:35:05Z","published":"2023-10-11T13:35:05Z","title":"KwaiYiiMath: Technical Report","summary":" Recent advancements in large language models (LLMs) have demonstrated\nremarkable abilities in handling a variety of natural language processing (NLP)\ndownstream tasks, even on mathematical tasks requiring multi-step reasoning. In\nthis report, we introduce the KwaiYiiMath which enhances the mathematical\nreasoning abilities of KwaiYiiBase1, by applying Supervised Fine-Tuning (SFT)\nand Reinforced Learning from Human Feedback (RLHF), including on both English\nand Chinese mathematical tasks. Meanwhile, we also constructed a small-scale\nChinese primary school mathematics test set (named KMath), consisting of 188\nexamples to evaluate the correctness of the problem-solving process generated\nby the models. Empirical studies demonstrate that KwaiYiiMath can achieve\nstate-of-the-art (SOTA) performance on GSM8k, CMath, and KMath compared with\nthe similar size models, respectively.\n","authors":["Jiayi Fu","Lei Lin","Xiaoyang Gao","Pengli Liu","Zhengzong Chen","Zhirui Yang","Shengnan Zhang","Xue Zheng","Yan Li","Yuliang Liu","Xucheng Ye","Yiqiao Liao","Chao Liao","Bin Chen","Chengru Song","Junchen Wan","Zijia Lin","Fuzheng Zhang","Zhongyuan Wang","Di Zhang","Kun Gai"],"pdf_url":"https://arxiv.org/pdf/2310.07488v1.pdf","comment":"technical report"},{"id":"http://arxiv.org/abs/2310.07485v1","updated":"2023-10-11T13:32:04Z","published":"2023-10-11T13:32:04Z","title":"Nonlinear embeddings for conserving Hamiltonians and other quantities\n with Neural Galerkin schemes","summary":" This work focuses on the conservation of quantities such as Hamiltonians,\nmass, and momentum when solution fields of partial differential equations are\napproximated with nonlinear parametrizations such as deep networks. The\nproposed approach builds on Neural Galerkin schemes that are based on the\nDirac--Frenkel variational principle to train nonlinear parametrizations\nsequentially in time. We first show that only adding constraints that aim to\nconserve quantities in continuous time can be insufficient because the\nnonlinear dependence on the parameters implies that even quantities that are\nlinear in the solution fields become nonlinear in the parameters and thus are\nchallenging to discretize in time. Instead, we propose Neural Galerkin schemes\nthat compute at each time step an explicit embedding onto the manifold of\nnonlinearly parametrized solution fields to guarantee conservation of\nquantities. The embeddings can be combined with standard explicit and implicit\ntime integration schemes. Numerical experiments demonstrate that the proposed\napproach conserves quantities up to machine precision.\n","authors":["Paul Schwerdtner","Philipp Schulze","Jules Berman","Benjamin Peherstorfer"],"pdf_url":"https://arxiv.org/pdf/2310.07485v1.pdf","comment":"29 pages, 8 figures"},{"id":"http://arxiv.org/abs/2310.05869v2","updated":"2023-10-11T13:25:13Z","published":"2023-10-09T17:05:25Z","title":"HyperAttention: Long-context Attention in Near-Linear Time","summary":" We present an approximate attention mechanism named HyperAttention to address\nthe computational challenges posed by the growing complexity of long contexts\nused in Large Language Models (LLMs). Recent work suggests that in the\nworst-case scenario, quadratic time is necessary unless the entries of the\nattention matrix are bounded or the matrix has low stable rank. We introduce\ntwo parameters which measure: (1) the max column norm in the normalized\nattention matrix, and (2) the ratio of row norms in the unnormalized attention\nmatrix after detecting and removing large entries. We use these fine-grained\nparameters to capture the hardness of the problem. Despite previous lower\nbounds, we are able to achieve a linear time sampling algorithm even when the\nmatrix has unbounded entries or a large stable rank, provided the above\nparameters are small. HyperAttention features a modular design that easily\naccommodates integration of other fast low-level implementations, particularly\nFlashAttention. Empirically, employing Locality Sensitive Hashing (LSH) to\nidentify large entries, HyperAttention outperforms existing methods, giving\nsignificant speed improvements compared to state-of-the-art solutions like\nFlashAttention. We validate the empirical performance of HyperAttention on a\nvariety of different long-context length datasets. For example, HyperAttention\nmakes the inference time of ChatGLM2 50\\% faster on 32k context length while\nperplexity increases from 5.6 to 6.3. On larger context length, e.g., 131k,\nwith causal masking, HyperAttention offers 5-fold speedup on a single attention\nlayer.\n","authors":["Insu Han","Rajesh Jayaram","Amin Karbasi","Vahab Mirrokni","David P. Woodruff","Amir Zandieh"],"pdf_url":"https://arxiv.org/pdf/2310.05869v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2207.07068v4","updated":"2023-10-11T13:09:46Z","published":"2022-07-14T17:16:45Z","title":"Bias Mitigation for Machine Learning Classifiers: A Comprehensive Survey","summary":" This paper provides a comprehensive survey of bias mitigation methods for\nachieving fairness in Machine Learning (ML) models. We collect a total of 341\npublications concerning bias mitigation for ML classifiers. These methods can\nbe distinguished based on their intervention procedure (i.e., pre-processing,\nin-processing, post-processing) and the technique they apply. We investigate\nhow existing bias mitigation methods are evaluated in the literature. In\nparticular, we consider datasets, metrics and benchmarking. Based on the\ngathered insights (e.g., What is the most popular fairness metric? How many\ndatasets are used for evaluating bias mitigation methods?), we hope to support\npractitioners in making informed choices when developing and evaluating new\nbias mitigation methods.\n","authors":["Max Hort","Zhenpeng Chen","Jie M. Zhang","Mark Harman","Federica Sarro"],"pdf_url":"https://arxiv.org/pdf/2207.07068v4.pdf","comment":"52 pages, 7 figures"},{"id":"http://arxiv.org/abs/2310.07464v1","updated":"2023-10-11T13:05:33Z","published":"2023-10-11T13:05:33Z","title":"Deep Learning Predicts Biomarker Status and Discovers Related\n Histomorphology Characteristics for Low-Grade Glioma","summary":" Biomarker detection is an indispensable part in the diagnosis and treatment\nof low-grade glioma (LGG). However, current LGG biomarker detection methods\nrely on expensive and complex molecular genetic testing, for which\nprofessionals are required to analyze the results, and intra-rater variability\nis often reported. To overcome these challenges, we propose an interpretable\ndeep learning pipeline, a Multi-Biomarker Histomorphology Discoverer\n(Multi-Beholder) model based on the multiple instance learning (MIL) framework,\nto predict the status of five biomarkers in LGG using only hematoxylin and\neosin-stained whole slide images and slide-level biomarker status labels.\nSpecifically, by incorporating the one-class classification into the MIL\nframework, accurate instance pseudo-labeling is realized for instance-level\nsupervision, which greatly complements the slide-level labels and improves the\nbiomarker prediction performance. Multi-Beholder demonstrates superior\nprediction performance and generalizability for five LGG biomarkers\n(AUROC=0.6469-0.9735) in two cohorts (n=607) with diverse races and scanning\nprotocols. Moreover, the excellent interpretability of Multi-Beholder allows\nfor discovering the quantitative and qualitative correlations between biomarker\nstatus and histomorphology characteristics. Our pipeline not only provides a\nnovel approach for biomarker prediction, enhancing the applicability of\nmolecular treatments for LGG patients but also facilitates the discovery of new\nmechanisms in molecular functionality and LGG progression.\n","authors":["Zijie Fang","Yihan Liu","Yifeng Wang","Xiangyang Zhang","Yang Chen","Changjing Cai","Yiyang Lin","Ying Han","Zhi Wang","Shan Zeng","Hong Shen","Jun Tan","Yongbing Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.07464v1.pdf","comment":"47 pages, 6 figures"},{"id":"http://arxiv.org/abs/2310.07463v1","updated":"2023-10-11T13:05:28Z","published":"2023-10-11T13:05:28Z","title":"Uncovering ECG Changes during Healthy Aging using Explainable AI","summary":" Cardiovascular diseases remain the leading global cause of mortality. This\nnecessitates a profound understanding of heart aging processes to diagnose\nconstraints in cardiovascular fitness. Traditionally, most of such insights\nhave been drawn from the analysis of electrocardiogram (ECG) feature changes of\nindividuals as they age. However, these features, while informative, may\npotentially obscure underlying data relationships. In this paper, we employ a\ndeep-learning model and a tree-based model to analyze ECG data from a robust\ndataset of healthy individuals across varying ages in both raw signals and ECG\nfeature format. Explainable AI techniques are then used to identify ECG\nfeatures or raw signal characteristics are most discriminative for\ndistinguishing between age groups. Our analysis with tree-based classifiers\nreveal age-related declines in inferred breathing rates and identifies notably\nhigh SDANN values as indicative of elderly individuals, distinguishing them\nfrom younger adults. Furthermore, the deep-learning model underscores the\npivotal role of the P-wave in age predictions across all age groups, suggesting\npotential changes in the distribution of different P-wave types with age. These\nfindings shed new light on age-related ECG changes, offering insights that\ntranscend traditional feature-based approaches.\n","authors":["Gabriel Ott","Yannik Schaubelt","Juan Miguel Lopez Alcaraz","Wilhelm Haverkamp","Nils Strodthoff"],"pdf_url":"https://arxiv.org/pdf/2310.07463v1.pdf","comment":"10 pages, 8 figures, code available under\n https://github.com/AI4HealthUOL/ECG-aging"},{"id":"http://arxiv.org/abs/2310.07461v1","updated":"2023-10-11T13:05:03Z","published":"2023-10-11T13:05:03Z","title":"Efficient machine-learning surrogates for large-scale geological carbon\n and energy storage","summary":" Geological carbon and energy storage are pivotal for achieving net-zero\ncarbon emissions and addressing climate change. However, they face\nuncertainties due to geological factors and operational limitations, resulting\nin possibilities of induced seismic events or groundwater contamination. To\novercome these challenges, we propose a specialized machine-learning (ML) model\nto manage extensive reservoir models efficiently.\n While ML approaches hold promise for geological carbon storage, the\nsubstantial computational resources required for large-scale analysis are the\nobstacle. We've developed a method to reduce the training cost for deep neural\noperator models, using domain decomposition and a topology embedder to link\nspatio-temporal points. This approach allows accurate predictions within the\nmodel's domain, even for untrained data, enhancing ML efficiency for\nlarge-scale geological storage applications.\n","authors":["Teeratorn Kadeethum","Stephen J. Verzi","Hongkyu Yoon"],"pdf_url":"https://arxiv.org/pdf/2310.07461v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07446v1","updated":"2023-10-11T12:48:45Z","published":"2023-10-11T12:48:45Z","title":"ProbTS: A Unified Toolkit to Probe Deep Time-series Forecasting","summary":" Time-series forecasting serves as a linchpin in a myriad of applications,\nspanning various domains. With the growth of deep learning, this arena has\nbifurcated into two salient branches: one focuses on crafting specific neural\narchitectures tailored for time series, and the other harnesses advanced deep\ngenerative models for probabilistic forecasting. While both branches have made\nsignificant progress, their differences across data scenarios, methodological\nfocuses, and decoding schemes pose profound, yet unexplored, research\nquestions. To bridge this knowledge chasm, we introduce ProbTS, a pioneering\ntoolkit developed to synergize and compare these two distinct branches. Endowed\nwith a unified data module, a modularized model module, and a comprehensive\nevaluator module, ProbTS allows us to revisit and benchmark leading methods\nfrom both branches. The scrutiny with ProbTS highlights their distinct\ncharacteristics, relative strengths and weaknesses, and areas that need further\nexploration. Our analyses point to new avenues for research, aiming for more\neffective time-series forecasting.\n","authors":["Jiawen Zhang","Xumeng Wen","Shun Zheng","Jia Li","Jiang Bian"],"pdf_url":"https://arxiv.org/pdf/2310.07446v1.pdf","comment":"Under Review"},{"id":"http://arxiv.org/abs/2310.07437v1","updated":"2023-10-11T12:40:07Z","published":"2023-10-11T12:40:07Z","title":"A Branched Deep Convolutional Network for Forecasting the Occurrence of\n Hazes in Paris using Meteorological Maps with Different Characteristic\n Spatial Scales","summary":" A deep learning platform has been developed to forecast the occurrence of the\nlow visibility events or hazes. It is trained by using multi-decadal daily\nregional maps of various meteorological and hydrological variables as input\nfeatures and surface visibility observations as the targets. To better preserve\nthe characteristic spatial information of different input features for\ntraining, two branched architectures have recently been developed for the case\nof Paris hazes. These new architectures have improved the performance of the\nnetwork, producing reasonable scores in both validation and a blind forecasting\nevaluation using the data of 2021 and 2022 that have not been used in the\ntraining and validation.\n","authors":["Chien Wang"],"pdf_url":"https://arxiv.org/pdf/2310.07437v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07435v1","updated":"2023-10-11T12:36:42Z","published":"2023-10-11T12:36:42Z","title":"Generalized Mixture Model for Extreme Events Forecasting in Time Series\n Data","summary":" Time Series Forecasting (TSF) is a widely researched topic with broad\napplications in weather forecasting, traffic control, and stock price\nprediction. Extreme values in time series often significantly impact human and\nnatural systems, but predicting them is challenging due to their rare\noccurrence. Statistical methods based on Extreme Value Theory (EVT) provide a\nsystematic approach to modeling the distribution of extremes, particularly the\nGeneralized Pareto (GP) distribution for modeling the distribution of\nexceedances beyond a threshold. To overcome the subpar performance of deep\nlearning in dealing with heavy-tailed data, we propose a novel framework to\nenhance the focus on extreme events. Specifically, we propose a Deep Extreme\nMixture Model with Autoencoder (DEMMA) for time series prediction. The model\ncomprises two main modules: 1) a generalized mixture distribution based on the\nHurdle model and a reparameterized GP distribution form independent of the\nextreme threshold, 2) an Autoencoder-based LSTM feature extractor and a\nquantile prediction module with a temporal attention mechanism. We demonstrate\nthe effectiveness of our approach on multiple real-world rainfall datasets.\n","authors":["Jincheng Wang","Yue Gao"],"pdf_url":"https://arxiv.org/pdf/2310.07435v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07434v1","updated":"2023-10-11T12:36:38Z","published":"2023-10-11T12:36:38Z","title":"HealthWalk: Promoting Health and Mobility through Sensor-Based Rollator\n Walker Assistance","summary":" Rollator walkers allow people with physical limitations to increase their\nmobility and give them the confidence and independence to participate in\nsociety for longer. However, rollator walker users often have poor posture,\nleading to further health problems and, in the worst case, falls. Integrating\nsensors into rollator walker designs can help to address this problem and\nresults in a platform that allows several other interesting use cases. This\npaper briefly overviews existing systems and the current research directions\nand challenges in this field. We also present our early HealthWalk rollator\nwalker prototype for data collection with older people, rheumatism, multiple\nsclerosis and Parkinson patients, and individuals with visual impairments.\n","authors":["Ivanna Kramer","Kevin Weirauch","Sabine Bauer","Mark Oliver Mints","Peer Neubert"],"pdf_url":"https://arxiv.org/pdf/2310.07434v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07433v1","updated":"2023-10-11T12:34:39Z","published":"2023-10-11T12:34:39Z","title":"Imitation Learning from Observation with Automatic Discount Scheduling","summary":" Humans often acquire new skills through observation and imitation. For\nrobotic agents, learning from the plethora of unlabeled video demonstration\ndata available on the Internet necessitates imitating the expert without access\nto its action, presenting a challenge known as Imitation Learning from\nObservations (ILfO). A common approach to tackle ILfO problems is to convert\nthem into inverse reinforcement learning problems, utilizing a proxy reward\ncomputed from the agent's and the expert's observations. Nonetheless, we\nidentify that tasks characterized by a progress dependency property pose\nsignificant challenges for such approaches; in these tasks, the agent needs to\ninitially learn the expert's preceding behaviors before mastering the\nsubsequent ones. Our investigation reveals that the main cause is that the\nreward signals assigned to later steps hinder the learning of initial\nbehaviors. To address this challenge, we present a novel ILfO framework that\nenables the agent to master earlier behaviors before advancing to later ones.\nWe introduce an Automatic Discount Scheduling (ADS) mechanism that adaptively\nalters the discount factor in reinforcement learning during the training phase,\nprioritizing earlier rewards initially and gradually engaging later rewards\nonly when the earlier behaviors have been mastered. Our experiments, conducted\non nine Meta-World tasks, demonstrate that our method significantly outperforms\nstate-of-the-art methods across all tasks, including those that are unsolvable\nby them.\n","authors":["Yuyang Liu","Weijun Dong","Yingdong Hu","Chuan Wen","Zhao-Heng Yin","Chongjie Zhang","Yang Gao"],"pdf_url":"https://arxiv.org/pdf/2310.07433v1.pdf","comment":"Submitted to ICLR 2024"},{"id":"http://arxiv.org/abs/2310.07430v1","updated":"2023-10-11T12:32:13Z","published":"2023-10-11T12:32:13Z","title":"Non-backtracking Graph Neural Networks","summary":" The celebrated message-passing updates for graph neural networks allow the\nrepresentation of large-scale graphs with local and computationally tractable\nupdates. However, the local updates suffer from backtracking, i.e., a message\nflows through the same edge twice and revisits the previously visited node.\nSince the number of message flows increases exponentially with the number of\nupdates, the redundancy in local updates prevents the graph neural network from\naccurately recognizing a particular message flow for downstream tasks. In this\nwork, we propose to resolve such a redundancy via the non-backtracking graph\nneural network (NBA-GNN) that updates a message without incorporating the\nmessage from the previously visited node. We further investigate how NBA-GNN\nalleviates the over-squashing of GNNs, and establish a connection between\nNBA-GNN and the impressive performance of non-backtracking updates for\nstochastic block model recovery. We empirically verify the effectiveness of our\nNBA-GNN on long-range graph benchmark and transductive node classification\nproblems.\n","authors":["Seonghyun Park","Narae Ryu","Gahee Kim","Dongyeop Woo","Se-Young Yun","Sungsoo Ahn"],"pdf_url":"https://arxiv.org/pdf/2310.07430v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07427v1","updated":"2023-10-11T12:28:52Z","published":"2023-10-11T12:28:52Z","title":"Quantum-Enhanced Forecasting: Leveraging Quantum Gramian Angular Field\n and CNNs for Stock Return Predictions","summary":" We propose a time series forecasting method named Quantum Gramian Angular\nField (QGAF). This approach merges the advantages of quantum computing\ntechnology with deep learning, aiming to enhance the precision of time series\nclassification and forecasting. We successfully transformed stock return time\nseries data into two-dimensional images suitable for Convolutional Neural\nNetwork (CNN) training by designing specific quantum circuits. Distinct from\nthe classical Gramian Angular Field (GAF) approach, QGAF's uniqueness lies in\neliminating the need for data normalization and inverse cosine calculations,\nsimplifying the transformation process from time series data to two-dimensional\nimages. To validate the effectiveness of this method, we conducted experiments\non datasets from three major stock markets: the China A-share market, the Hong\nKong stock market, and the US stock market. Experimental results revealed that\ncompared to the classical GAF method, the QGAF approach significantly improved\ntime series prediction accuracy, reducing prediction errors by an average of\n25\\% for Mean Absolute Error (MAE) and 48\\% for Mean Squared Error (MSE). This\nresearch confirms the potential and promising prospects of integrating quantum\ncomputing with deep learning techniques in financial time series forecasting.\n","authors":["Zhengmeng Xu","Hai Lin"],"pdf_url":"https://arxiv.org/pdf/2310.07427v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.13078v2","updated":"2023-10-11T12:21:30Z","published":"2023-09-21T02:46:20Z","title":"LPML: LLM-Prompting Markup Language for Mathematical Reasoning","summary":" In utilizing large language models (LLMs) for mathematical reasoning,\naddressing the errors in the reasoning and calculation present in the generated\ntext by LLMs is a crucial challenge. In this paper, we propose a novel\nframework that integrates the Chain-of-Thought (CoT) method with an external\ntool (Python REPL). We discovered that by prompting LLMs to generate structured\ntext in XML-like markup language, we could seamlessly integrate CoT and the\nexternal tool and control the undesired behaviors of LLMs. With our approach,\nLLMs can utilize Python computation to rectify errors within CoT. We applied\nour method to ChatGPT (GPT-3.5) to solve challenging mathematical problems and\ndemonstrated that combining CoT and Python REPL through the markup language\nenhances the reasoning capability of LLMs. Our approach enables LLMs to write\nthe markup language and perform advanced mathematical reasoning using only\nzero-shot prompting.\n","authors":["Ryutaro Yamauchi","Sho Sonoda","Akiyoshi Sannai","Wataru Kumagai"],"pdf_url":"https://arxiv.org/pdf/2309.13078v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2205.09615v4","updated":"2023-10-11T12:09:35Z","published":"2022-05-19T15:13:00Z","title":"EXACT: How to Train Your Accuracy","summary":" Classification tasks are usually evaluated in terms of accuracy. However,\naccuracy is discontinuous and cannot be directly optimized using gradient\nascent. Popular methods minimize cross-entropy, hinge loss, or other surrogate\nlosses, which can lead to suboptimal results. In this paper, we propose a new\noptimization framework by introducing stochasticity to a model's output and\noptimizing expected accuracy, i.e. accuracy of the stochastic model. Extensive\nexperiments on linear models and deep image classification show that the\nproposed optimization method is a powerful alternative to widely used\nclassification losses.\n","authors":["Ivan Karpukhin","Stanislav Dereka","Sergey Kolesnikov"],"pdf_url":"https://arxiv.org/pdf/2205.09615v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07418v1","updated":"2023-10-11T12:05:34Z","published":"2023-10-11T12:05:34Z","title":"Revisiting Plasticity in Visual Reinforcement Learning: Data, Modules\n and Training Stages","summary":" Plasticity, the ability of a neural network to evolve with new data, is\ncrucial for high-performance and sample-efficient visual reinforcement learning\n(VRL). Although methods like resetting and regularization can potentially\nmitigate plasticity loss, the influences of various components within the VRL\nframework on the agent's plasticity are still poorly understood. In this work,\nwe conduct a systematic empirical exploration focusing on three primary\nunderexplored facets and derive the following insightful conclusions: (1) data\naugmentation is essential in maintaining plasticity; (2) the critic's\nplasticity loss serves as the principal bottleneck impeding efficient training;\nand (3) without timely intervention to recover critic's plasticity in the early\nstages, its loss becomes catastrophic. These insights suggest a novel strategy\nto address the high replay ratio (RR) dilemma, where exacerbated plasticity\nloss hinders the potential improvements of sample efficiency brought by\nincreased reuse frequency. Rather than setting a static RR for the entire\ntraining process, we propose Adaptive RR, which dynamically adjusts the RR\nbased on the critic's plasticity level. Extensive evaluations indicate that\nAdaptive RR not only avoids catastrophic plasticity loss in the early stages\nbut also benefits from more frequent reuse in later phases, resulting in\nsuperior sample efficiency.\n","authors":["Guozheng Ma","Lu Li","Sen Zhang","Zixuan Liu","Zhen Wang","Yixin Chen","Li Shen","Xueqian Wang","Dacheng Tao"],"pdf_url":"https://arxiv.org/pdf/2310.07418v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07417v1","updated":"2023-10-11T12:03:19Z","published":"2023-10-11T12:03:19Z","title":"What can knowledge graph alignment gain with Neuro-Symbolic learning\n approaches?","summary":" Knowledge Graphs (KG) are the backbone of many data-intensive applications\nsince they can represent data coupled with its meaning and context. Aligning\nKGs across different domains and providers is necessary to afford a fuller and\nintegrated representation. A severe limitation of current KG alignment (KGA)\nalgorithms is that they fail to articulate logical thinking and reasoning with\nlexical, structural, and semantic data learning. Deep learning models are\nincreasingly popular for KGA inspired by their good performance in other tasks,\nbut they suffer from limitations in explainability, reasoning, and data\nefficiency. Hybrid neurosymbolic learning models hold the promise of\nintegrating logical and data perspectives to produce high-quality alignments\nthat are explainable and support validation through human-centric approaches.\nThis paper examines the current state of the art in KGA and explores the\npotential for neurosymbolic integration, highlighting promising research\ndirections for combining these fields.\n","authors":["Pedro Giesteira Cotovio","Ernesto Jimenez-Ruiz","Catia Pesquita"],"pdf_url":"https://arxiv.org/pdf/2310.07417v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07416v1","updated":"2023-10-11T12:01:52Z","published":"2023-10-11T12:01:52Z","title":"A Novel Voronoi-based Convolutional Neural Network Framework for Pushing\n Person Detection in Crowd Videos","summary":" Analyzing the microscopic dynamics of pushing behavior within crowds can\noffer valuable insights into crowd patterns and interactions. By identifying\ninstances of pushing in crowd videos, a deeper understanding of when, where,\nand why such behavior occurs can be achieved. This knowledge is crucial to\ncreating more effective crowd management strategies, optimizing crowd flow, and\nenhancing overall crowd experiences. However, manually identifying pushing\nbehavior at the microscopic level is challenging, and the existing automatic\napproaches cannot detect such microscopic behavior. Thus, this article\nintroduces a novel automatic framework for identifying pushing in videos of\ncrowds on a microscopic level. The framework comprises two main components: i)\nFeature extraction and ii) Video labeling. In the feature extraction component,\na new Voronoi-based method is developed for determining the local regions\nassociated with each person in the input video. Subsequently, these regions are\nfed into EfficientNetV1B0 Convolutional Neural Network to extract the deep\nfeatures of each person over time. In the second component, a combination of a\nfully connected layer with a Sigmoid activation function is employed to analyze\nthese deep features and annotate the individuals involved in pushing within the\nvideo. The framework is trained and evaluated on a new dataset created using\nsix real-world experiments, including their corresponding ground truths. The\nexperimental findings indicate that the suggested framework outperforms seven\nbaseline methods that are employed for comparative analysis purposes.\n","authors":["Ahmed Alia","Mohammed Maree","Mohcine Chraibi","Armin Seyfried"],"pdf_url":"https://arxiv.org/pdf/2310.07416v1.pdf","comment":"21 pages"},{"id":"http://arxiv.org/abs/2310.07402v1","updated":"2023-10-11T11:38:18Z","published":"2023-10-11T11:38:18Z","title":"NuTime: Numerically Multi-Scaled Embedding for Large-Scale Time Series\n Pretraining","summary":" Recent research on time-series self-supervised models shows great promise in\nlearning semantic representations. However, it has been limited to small-scale\ndatasets, e.g., thousands of temporal sequences. In this work, we make key\ntechnical contributions that are tailored to the numerical properties of\ntime-series data and allow the model to scale to large datasets, e.g., millions\nof temporal sequences. We adopt the Transformer architecture by first\npartitioning the input into non-overlapping windows. Each window is then\ncharacterized by its normalized shape and two scalar values denoting the mean\nand standard deviation within each window. To embed scalar values that may\npossess arbitrary numerical scales to high-dimensional vectors, we propose a\nnumerically multi-scaled embedding module enumerating all possible scales for\nthe scalar values. The model undergoes pretraining using the proposed\nnumerically multi-scaled embedding with a simple contrastive objective on a\nlarge-scale dataset containing over a million sequences. We study its transfer\nperformance on a number of univariate and multivariate classification\nbenchmarks. Our method exhibits remarkable improvement against previous\nrepresentation learning approaches and establishes the new state of the art,\neven compared with domain-specific non-learning-based methods.\n","authors":["Chenguo Lin","Xumeng Wen","Wei Cao","Congrui Huang","Jiang Bian","Stephen Lin","Zhirong Wu"],"pdf_url":"https://arxiv.org/pdf/2310.07402v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.05682v2","updated":"2023-10-11T11:28:40Z","published":"2023-10-09T12:51:46Z","title":"Analysis of Rainfall Variability and Water Extent of Selected Hydropower\n Reservoir Using Google Earth Engine (GEE): A Case Study from Two Tropical\n Countries, Sri Lanka and Vietnam","summary":" This study presents a comprehensive remote sensing analysis of rainfall\npatterns and selected hydropower reservoir water extent in two tropical monsoon\ncountries, Vietnam and Sri Lanka. The aim is to understand the relationship\nbetween remotely sensed rainfall data and the dynamic changes (monthly) in\nreservoir water extent. The analysis utilizes high-resolution optical imagery\nand Sentinel-1 Synthetic Aperture Radar (SAR) data to observe and monitor water\nbodies during different weather conditions, especially during the monsoon\nseason. The average annual rainfall for both countries is determined, and\nspatiotemporal variations in monthly average rainfall are examined at regional\nand reservoir basin levels using the Climate Hazards Group InfraRed\nPrecipitation with Station (CHIRPS) dataset from 1981 to 2022. Water extents\nare derived for selected reservoirs using Sentinel-1 SAR Ground Range Detected\n(GRD) images in Vietnam and Sri Lanka from 2017 to 2022. The images are\npre-processed and corrected using terrain correction and refined Lee filter. An\nautomated thresholding algorithm, OTSU, distinguishes water and land, taking\nadvantage of both VV and VH polarization data. The connected pixel count\nthreshold is applied to enhance result accuracy. The results indicate a clear\nrelationship between rainfall patterns and reservoir water extent, with\nincreased precipitation during the monsoon season leading to higher water\nextents in the later months. This study contributes to understanding how\nrainfall variability impacts reservoir water resources in tropical monsoon\nregions. The preliminary findings can inform water resource management\nstrategies and support these countries' decision-making processes related to\nhydropower generation, flood management, and irrigation.\n","authors":["Punsisi Rajakaruna","Surajit Ghosh","Bunyod Holmatov"],"pdf_url":"https://arxiv.org/pdf/2310.05682v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07392v1","updated":"2023-10-11T11:20:35Z","published":"2023-10-11T11:20:35Z","title":"Deep Kernel and Image Quality Estimators for Optimizing Robotic\n Ultrasound Controller using Bayesian Optimization","summary":" Ultrasound is a commonly used medical imaging modality that requires expert\nsonographers to manually maneuver the ultrasound probe based on the acquired\nimage. Autonomous Robotic Ultrasound (A-RUS) is an appealing alternative to\nthis manual procedure in order to reduce sonographers' workload. The key\nchallenge to A-RUS is optimizing the ultrasound image quality for the region of\ninterest across different patients. This requires knowledge of anatomy,\nrecognition of error sources and precise probe position, orientation and\npressure. Sample efficiency is important while optimizing these parameters\nassociated with the robotized probe controller. Bayesian Optimization (BO), a\nsample-efficient optimization framework, has recently been applied to optimize\nthe 2D motion of the probe. Nevertheless, further improvements are needed to\nimprove the sample efficiency for high-dimensional control of the probe. We aim\nto overcome this problem by using a neural network to learn a low-dimensional\nkernel in BO, termed as Deep Kernel (DK). The neural network of DK is trained\nusing probe and image data acquired during the procedure. The two image quality\nestimators are proposed that use a deep convolution neural network and provide\nreal-time feedback to the BO. We validated our framework using these two\nfeedback functions on three urinary bladder phantoms. We obtained over 50%\nincrease in sample efficiency for 6D control of the robotized probe.\nFurthermore, our results indicate that this performance enhancement in BO is\nindependent of the specific training dataset, demonstrating inter-patient\nadaptability.\n","authors":["Deepak Raina","SH Chandrashekhara","Richard Voyles","Juan Wachs","Subir Kumar Saha"],"pdf_url":"https://arxiv.org/pdf/2310.07392v1.pdf","comment":"Accepted in IEEE International Symposium on Medical Robotics (ISMR)\n 2023"},{"id":"http://arxiv.org/abs/2310.07380v1","updated":"2023-10-11T10:55:14Z","published":"2023-10-11T10:55:14Z","title":"Histopathological Image Classification and Vulnerability Analysis using\n Federated Learning","summary":" Healthcare is one of the foremost applications of machine learning (ML).\nTraditionally, ML models are trained by central servers, which aggregate data\nfrom various distributed devices to forecast the results for newly generated\ndata. This is a major concern as models can access sensitive user information,\nwhich raises privacy concerns. A federated learning (FL) approach can help\naddress this issue: A global model sends its copy to all clients who train\nthese copies, and the clients send the updates (weights) back to it. Over time,\nthe global model improves and becomes more accurate. Data privacy is protected\nduring training, as it is conducted locally on the clients' devices.\n However, the global model is susceptible to data poisoning. We develop a\nprivacy-preserving FL technique for a skin cancer dataset and show that the\nmodel is prone to data poisoning attacks. Ten clients train the model, but one\nof them intentionally introduces flipped labels as an attack. This reduces the\naccuracy of the global model. As the percentage of label flipping increases,\nthere is a noticeable decrease in accuracy. We use a stochastic gradient\ndescent optimization algorithm to find the most optimal accuracy for the model.\nAlthough FL can protect user privacy for healthcare diagnostics, it is also\nvulnerable to data poisoning, which must be addressed.\n","authors":["Sankalp Vyas","Amar Nath Patra","Raj Mani Shukla"],"pdf_url":"https://arxiv.org/pdf/2310.07380v1.pdf","comment":"Accepted in IEEE International Conference on Trust, Security and\n Privacy in Computing and Communications (TrustCom)"},{"id":"http://arxiv.org/abs/2310.07379v1","updated":"2023-10-11T10:54:44Z","published":"2023-10-11T10:54:44Z","title":"Causal Unsupervised Semantic Segmentation","summary":" Unsupervised semantic segmentation aims to achieve high-quality semantic\ngrouping without human-labeled annotations. With the advent of self-supervised\npre-training, various frameworks utilize the pre-trained features to train\nprediction heads for unsupervised dense prediction. However, a significant\nchallenge in this unsupervised setup is determining the appropriate level of\nclustering required for segmenting concepts. To address it, we propose a novel\nframework, CAusal Unsupervised Semantic sEgmentation (CAUSE), which leverages\ninsights from causal inference. Specifically, we bridge intervention-oriented\napproach (i.e., frontdoor adjustment) to define suitable two-step tasks for\nunsupervised prediction. The first step involves constructing a concept\nclusterbook as a mediator, which represents possible concept prototypes at\ndifferent levels of granularity in a discretized form. Then, the mediator\nestablishes an explicit link to the subsequent concept-wise self-supervised\nlearning for pixel-level grouping. Through extensive experiments and analyses\non various datasets, we corroborate the effectiveness of CAUSE and achieve\nstate-of-the-art performance in unsupervised semantic segmentation.\n","authors":["Junho Kim","Byung-Kwan Lee","Yong Man Ro"],"pdf_url":"https://arxiv.org/pdf/2310.07379v1.pdf","comment":"code available:\n https://github.com/ByungKwanLee/Causal-Unsupervised-Segmentation"},{"id":"http://arxiv.org/abs/2305.08732v3","updated":"2023-10-11T10:51:12Z","published":"2023-05-15T15:47:09Z","title":"Knowledge Rumination for Pre-trained Language Models","summary":" Previous studies have revealed that vanilla pre-trained language models\n(PLMs) lack the capacity to handle knowledge-intensive NLP tasks alone; thus,\nseveral works have attempted to integrate external knowledge into PLMs.\nHowever, despite the promising outcome, we empirically observe that PLMs may\nhave already encoded rich knowledge in their pre-trained parameters but fail to\nfully utilize them when applying them to knowledge-intensive tasks. In this\npaper, we propose a new paradigm dubbed Knowledge Rumination to help the\npre-trained language model utilize that related latent knowledge without\nretrieving it from the external corpus. By simply adding a prompt like \"As far\nas I know\" to the PLMs, we try to review related latent knowledge and inject\nthem back into the model for knowledge consolidation. We apply the proposed\nknowledge rumination to various language models, including RoBERTa, DeBERTa,\nand GPT-3. Experimental results on six commonsense reasoning tasks and GLUE\nbenchmarks demonstrate the effectiveness of our proposed approach, which proves\nthat the knowledge stored in PLMs can be better exploited to enhance\nperformance. Code is available in\nhttps://github.com/zjunlp/knowledge-rumination.\n","authors":["Yunzhi Yao","Peng Wang","Shengyu Mao","Chuanqi Tan","Fei Huang","Huajun Chen","Ningyu Zhang"],"pdf_url":"https://arxiv.org/pdf/2305.08732v3.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.07371v1","updated":"2023-10-11T10:41:51Z","published":"2023-10-11T10:41:51Z","title":"Experimental quantum natural gradient optimization in photonics","summary":" Variational quantum algorithms (VQAs) combining the advantages of\nparameterized quantum circuits and classical optimizers, promise practical\nquantum applications in the Noisy Intermediate-Scale Quantum era. The\nperformance of VQAs heavily depends on the optimization method. Compared with\ngradient-free and ordinary gradient descent methods, the quantum natural\ngradient (QNG), which mirrors the geometric structure of the parameter space,\ncan achieve faster convergence and avoid local minima more easily, thereby\nreducing the cost of circuit executions. We utilized a fully programmable\nphotonic chip to experimentally estimate the QNG in photonics for the first\ntime. We obtained the dissociation curve of the He-H$^+$ cation and achieved\nchemical accuracy, verifying the outperformance of QNG optimization on a\nphotonic device. Our work opens up a vista of utilizing QNG in photonics to\nimplement practical near-term quantum applications.\n","authors":["Yizhi Wang","Shichuan Xue","Yaxuan Wang","Jiangfang Ding","Weixu Shi","Dongyang Wang","Yong Liu","Yingwen Liu","Xiang Fu","Guangyao Huang","Anqi Huang","Mingtang Deng","Junjie Wu"],"pdf_url":"https://arxiv.org/pdf/2310.07371v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07370v1","updated":"2023-10-11T10:40:43Z","published":"2023-10-11T10:40:43Z","title":"Orthogonal Random Features: Explicit Forms and Sharp Inequalities","summary":" Random features have been introduced to scale up kernel methods via\nrandomization techniques. In particular, random Fourier features and orthogonal\nrandom features were used to approximate the popular Gaussian kernel. The\nformer is performed by a random Gaussian matrix and leads exactly to the\nGaussian kernel after averaging. In this work, we analyze the bias and the\nvariance of the kernel approximation based on orthogonal random features which\nmakes use of Haar orthogonal matrices. We provide explicit expressions for\nthese quantities using normalized Bessel functions and derive sharp exponential\nbounds supporting the view that orthogonal random features are more informative\nthan random Fourier features.\n","authors":["Nizar Demni","Hachem Kadri"],"pdf_url":"https://arxiv.org/pdf/2310.07370v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07367v1","updated":"2023-10-11T10:34:52Z","published":"2023-10-11T10:34:52Z","title":"Improved Analysis of Sparse Linear Regression in Local Differential\n Privacy Model","summary":" In this paper, we revisit the problem of sparse linear regression in the\nlocal differential privacy (LDP) model. Existing research in the\nnon-interactive and sequentially local models has focused on obtaining the\nlower bounds for the case where the underlying parameter is $1$-sparse, and\nextending such bounds to the more general $k$-sparse case has proven to be\nchallenging. Moreover, it is unclear whether efficient non-interactive LDP\n(NLDP) algorithms exist. To address these issues, we first consider the problem\nin the $\\epsilon$ non-interactive LDP model and provide a lower bound of\n$\\Omega(\\frac{\\sqrt{dk\\log d}}{\\sqrt{n}\\epsilon})$ on the $\\ell_2$-norm\nestimation error for sub-Gaussian data, where $n$ is the sample size and $d$ is\nthe dimension of the space. We propose an innovative NLDP algorithm, the very\nfirst of its kind for the problem. As a remarkable outcome, this algorithm also\nyields a novel and highly efficient estimator as a valuable by-product. Our\nalgorithm achieves an upper bound of\n$\\tilde{O}({\\frac{d\\sqrt{k}}{\\sqrt{n}\\epsilon}})$ for the estimation error when\nthe data is sub-Gaussian, which can be further improved by a factor of\n$O(\\sqrt{d})$ if the server has additional public but unlabeled data. For the\nsequentially interactive LDP model, we show a similar lower bound of\n$\\Omega({\\frac{\\sqrt{dk}}{\\sqrt{n}\\epsilon}})$. As for the upper bound, we\nrectify a previous method and show that it is possible to achieve a bound of\n$\\tilde{O}(\\frac{k\\sqrt{d}}{\\sqrt{n}\\epsilon})$. Our findings reveal\nfundamental differences between the non-private case, central DP model, and\nlocal DP model in the sparse linear regression problem.\n","authors":["Liyang Zhu","Meng Ding","Vaneet Aggarwal","Jinhui Xu","Di Wang"],"pdf_url":"https://arxiv.org/pdf/2310.07367v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07365v1","updated":"2023-10-11T10:30:49Z","published":"2023-10-11T10:30:49Z","title":"GraphControl: Adding Conditional Control to Universal Graph Pre-trained\n Models for Graph Domain Transfer Learning","summary":" Graph-structured data is ubiquitous in the world which models complex\nrelationships between objects, enabling various Web applications. Daily\ninfluxes of unlabeled graph data on the Web offer immense potential for these\napplications. Graph self-supervised algorithms have achieved significant\nsuccess in acquiring generic knowledge from abundant unlabeled graph data.\nThese pre-trained models can be applied to various downstream Web applications,\nsaving training time and improving downstream (target) performance. However,\ndifferent graphs, even across seemingly similar domains, can differ\nsignificantly in terms of attribute semantics, posing difficulties, if not\ninfeasibility, for transferring the pre-trained models to downstream tasks.\nConcretely speaking, for example, the additional task-specific node information\nin downstream tasks (specificity) is usually deliberately omitted so that the\npre-trained representation (transferability) can be leveraged. The trade-off as\nsuch is termed as \"transferability-specificity dilemma\" in this work. To\naddress this challenge, we introduce an innovative deployment module coined as\nGraphControl, motivated by ControlNet, to realize better graph domain transfer\nlearning. Specifically, by leveraging universal structural pre-trained models\nand GraphControl, we align the input space across various graphs and\nincorporate unique characteristics of target data as conditional inputs. These\nconditions will be progressively integrated into the model during fine-tuning\nor prompt tuning through ControlNet, facilitating personalized deployment.\nExtensive experiments show that our method significantly enhances the\nadaptability of pre-trained models on target attributed datasets, achieving\n1.4-3x performance gain. Furthermore, it outperforms training-from-scratch\nmethods on target data with a comparable margin and exhibits faster\nconvergence.\n","authors":["Yun Zhu","Yaoke Wang","Haizhou Shi","Zhenshuo Zhang","Siliang Tang"],"pdf_url":"https://arxiv.org/pdf/2310.07365v1.pdf","comment":"Under Review"},{"id":"http://arxiv.org/abs/2307.02484v4","updated":"2023-10-11T10:27:40Z","published":"2023-07-05T17:58:21Z","title":"Elastic Decision Transformer","summary":" This paper introduces Elastic Decision Transformer (EDT), a significant\nadvancement over the existing Decision Transformer (DT) and its variants.\nAlthough DT purports to generate an optimal trajectory, empirical evidence\nsuggests it struggles with trajectory stitching, a process involving the\ngeneration of an optimal or near-optimal trajectory from the best parts of a\nset of sub-optimal trajectories. The proposed EDT differentiates itself by\nfacilitating trajectory stitching during action inference at test time,\nachieved by adjusting the history length maintained in DT. Further, the EDT\noptimizes the trajectory by retaining a longer history when the previous\ntrajectory is optimal and a shorter one when it is sub-optimal, enabling it to\n\"stitch\" with a more optimal trajectory. Extensive experimentation demonstrates\nEDT's ability to bridge the performance gap between DT-based and Q\nLearning-based approaches. In particular, the EDT outperforms Q Learning-based\nmethods in a multi-task regime on the D4RL locomotion benchmark and Atari\ngames. Videos are available at: https://kristery.github.io/edt/\n","authors":["Yueh-Hua Wu","Xiaolong Wang","Masashi Hamaya"],"pdf_url":"https://arxiv.org/pdf/2307.02484v4.pdf","comment":"Accepted to NeurIPS 2023"},{"id":"http://arxiv.org/abs/2302.11509v2","updated":"2023-10-11T10:20:58Z","published":"2023-02-22T17:26:03Z","title":"Construction of Knowledge Graphs: State and Challenges","summary":" With knowledge graphs (KGs) at the center of numerous applications such as\nrecommender systems and question answering, the need for generalized pipelines\nto construct and continuously update such KGs is increasing. While the\nindividual steps that are necessary to create KGs from unstructured (e.g. text)\nand structured data sources (e.g. databases) are mostly well-researched for\ntheir one-shot execution, their adoption for incremental KG updates and the\ninterplay of the individual steps have hardly been investigated in a systematic\nmanner so far. In this work, we first discuss the main graph models for KGs and\nintroduce the major requirement for future KG construction pipelines. Next, we\nprovide an overview of the necessary steps to build high-quality KGs, including\ncross-cutting topics such as metadata management, ontology development, and\nquality assurance. We then evaluate the state of the art of KG construction\nw.r.t the introduced requirements for specific popular KGs as well as some\nrecent tools and strategies for KG construction. Finally, we identify areas in\nneed of further research and improvement.\n","authors":["Marvin Hofer","Daniel Obraczka","Alieh Saeedi","Hanna Köpcke","Erhard Rahm"],"pdf_url":"https://arxiv.org/pdf/2302.11509v2.pdf","comment":"51 pages, 5 figures, 4 tables, 328 references"},{"id":"http://arxiv.org/abs/2310.07359v1","updated":"2023-10-11T10:17:41Z","published":"2023-10-11T10:17:41Z","title":"Diagnosing Bipolar Disorder from 3-D Structural Magnetic Resonance\n Images Using a Hybrid GAN-CNN Method","summary":" Bipolar Disorder (BD) is a psychiatric condition diagnosed by repetitive\ncycles of hypomania and depression. Since diagnosing BD relies on subjective\nbehavioral assessments over a long period, a solid diagnosis based on objective\ncriteria is not straightforward. The current study responded to the described\nobstacle by proposing a hybrid GAN-CNN model to diagnose BD from 3-D structural\nMRI Images (sMRI). The novelty of this study stems from diagnosing BD from sMRI\nsamples rather than conventional datasets such as functional MRI (fMRI),\nelectroencephalography (EEG), and behavioral symptoms while removing the data\ninsufficiency usually encountered when dealing with sMRI samples. The impact of\nvarious augmentation ratios is also tested using 5-fold cross-validation. Based\non the results, this study obtains an accuracy rate of 75.8%, a sensitivity of\n60.3%, and a specificity of 82.5%, which are 3-5% higher than prior work while\nutilizing less than 6% sample counts. Next, it is demonstrated that a 2- D\nlayer-based GAN generator can effectively reproduce complex 3D brain samples, a\nmore straightforward technique than manual image processing. Lastly, the\noptimum augmentation threshold for the current study using 172 sMRI samples is\n50%, showing the applicability of the described method for larger sMRI\ndatasets. In conclusion, it is established that data augmentation using GAN\nimproves the accuracy of the CNN classifier using sMRI samples, thus developing\nmore reliable decision support systems to assist practitioners in identifying\nBD patients more reliably and in a shorter period\n","authors":["Masood Hamed Saghayan","Mohammad Hossein Zolfagharnasab","Ali Khadem","Farzam Matinfar","Hassan Rashidi"],"pdf_url":"https://arxiv.org/pdf/2310.07359v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07355v1","updated":"2023-10-11T10:12:43Z","published":"2023-10-11T10:12:43Z","title":"IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-training","summary":" In the field of medical Vision-Language Pre-training (VLP), significant\nefforts have been devoted to deriving text and image features from both\nclinical reports and associated medical images. However, most existing methods\nmay have overlooked the opportunity in leveraging the inherent hierarchical\nstructure of clinical reports, which are generally split into `findings' for\ndescriptive content and `impressions' for conclusive observation. Instead of\nutilizing this rich, structured format, current medical VLP approaches often\nsimplify the report into either a unified entity or fragmented tokens. In this\nwork, we propose a novel clinical prior guided VLP framework named IMITATE to\nlearn the structure information from medical reports with hierarchical\nvision-language alignment. The framework derives multi-level visual features\nfrom the chest X-ray (CXR) images and separately aligns these features with the\ndescriptive and the conclusive text encoded in the hierarchical medical report.\nFurthermore, a new clinical-informed contrastive loss is introduced for\ncross-modal learning, which accounts for clinical prior knowledge in\nformulating sample correlations in contrastive learning. The proposed model,\nIMITATE, outperforms baseline VLP methods across six different datasets,\nspanning five medical imaging downstream tasks. Comprehensive experimental\nresults highlight the advantages of integrating the hierarchical structure of\nmedical reports for vision-language alignment.\n","authors":["Che Liu","Sibo Cheng","Miaojing Shi","Anand Shah","Wenjia Bai","Rossella Arcucci"],"pdf_url":"https://arxiv.org/pdf/2310.07355v1.pdf","comment":"Under Review"},{"id":"http://arxiv.org/abs/2310.07351v1","updated":"2023-10-11T10:03:10Z","published":"2023-10-11T10:03:10Z","title":"Atom-Motif Contrastive Transformer for Molecular Property Prediction","summary":" Recently, Graph Transformer (GT) models have been widely used in the task of\nMolecular Property Prediction (MPP) due to their high reliability in\ncharacterizing the latent relationship among graph nodes (i.e., the atoms in a\nmolecule). However, most existing GT-based methods usually explore the basic\ninteractions between pairwise atoms, and thus they fail to consider the\nimportant interactions among critical motifs (e.g., functional groups consisted\nof several atoms) of molecules. As motifs in a molecule are significant\npatterns that are of great importance for determining molecular properties\n(e.g., toxicity and solubility), overlooking motif interactions inevitably\nhinders the effectiveness of MPP. To address this issue, we propose a novel\nAtom-Motif Contrastive Transformer (AMCT), which not only explores the\natom-level interactions but also considers the motif-level interactions. Since\nthe representations of atoms and motifs for a given molecule are actually two\ndifferent views of the same instance, they are naturally aligned to generate\nthe self-supervisory signals for model training. Meanwhile, the same motif can\nexist in different molecules, and hence we also employ the contrastive loss to\nmaximize the representation agreement of identical motifs across different\nmolecules. Finally, in order to clearly identify the motifs that are critical\nin deciding the properties of each molecule, we further construct a\nproperty-aware attention mechanism into our learning framework. Our proposed\nAMCT is extensively evaluated on seven popular benchmark datasets, and both\nquantitative and qualitative results firmly demonstrate its effectiveness when\ncompared with the state-of-the-art methods.\n","authors":["Wentao Yu","Shuo Chen","Chen Gong","Gang Niu","Masashi Sugiyama"],"pdf_url":"https://arxiv.org/pdf/2310.07351v1.pdf","comment":"submit to AAAI-24"},{"id":"http://arxiv.org/abs/2310.07347v1","updated":"2023-10-11T09:55:46Z","published":"2023-10-11T09:55:46Z","title":"Fast-ELECTRA for Efficient Pre-training","summary":" ELECTRA pre-trains language models by detecting tokens in a sequence that\nhave been replaced by an auxiliary model. Although ELECTRA offers a significant\nboost in efficiency, its potential is constrained by the training cost brought\nby the auxiliary model. Notably, this model, which is jointly trained with the\nmain model, only serves to assist the training of the main model and is\ndiscarded post-training. This results in a substantial amount of training cost\nbeing expended in vain. To mitigate this issue, we propose Fast-ELECTRA, which\nleverages an existing language model as the auxiliary model. To construct a\nlearning curriculum for the main model, we smooth its output distribution via\ntemperature scaling following a descending schedule. Our approach rivals the\nperformance of state-of-the-art ELECTRA-style pre-training methods, while\nsignificantly eliminating the computation and memory cost brought by the joint\ntraining of the auxiliary model. Our method also reduces the sensitivity to\nhyper-parameters and enhances the pre-training stability.\n","authors":["Chengyu Dong","Liyuan Liu","Hao Cheng","Jingbo Shang","Jianfeng Gao","Xiaodong Liu"],"pdf_url":"https://arxiv.org/pdf/2310.07347v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.06841v2","updated":"2023-10-11T09:53:51Z","published":"2023-04-13T22:20:54Z","title":"Video alignment using unsupervised learning of local and global features","summary":" In this paper, we tackle the problem of video alignment, the process of\nmatching the frames of a pair of videos containing similar actions. The main\nchallenge in video alignment is that accurate correspondence should be\nestablished despite the differences in the execution processes and appearances\nbetween the two videos. We introduce an unsupervised method for alignment that\nuses global and local features of the frames. In particular, we introduce\neffective features for each video frame using three machine vision tools:\nperson detection, pose estimation, and VGG network. Then, the features are\nprocessed and combined to construct a multidimensional time series that\nrepresents the video. The resulting time series are used to align videos of the\nsame actions using a novel version of dynamic time warping named Diagonalized\nDynamic Time Warping(DDTW). The main advantage of our approach is that no\ntraining is required, which makes it applicable for any new type of action\nwithout any need to collect training samples for it. For evaluation, we\nconsidered video synchronization and phase classification tasks on the Penn\naction dataset. Also, for an effective evaluation of the video synchronization\ntask, we present a new metric called Enclosed Area Error(EAE). The results show\nthat our method outperforms previous state-of-the-art methods, such as TCC, and\nother self-supervised and weakly supervised methods.\n","authors":["Niloufar Fakhfour","Mohammad ShahverdiKondori","Hoda Mohammadzade"],"pdf_url":"https://arxiv.org/pdf/2304.06841v2.pdf","comment":"19 pages, 6 figures"},{"id":"http://arxiv.org/abs/2310.07338v1","updated":"2023-10-11T09:37:38Z","published":"2023-10-11T09:37:38Z","title":"Towards Foundation Models for Learning on Tabular Data","summary":" Learning on tabular data underpins numerous real-world applications. Despite\nconsiderable efforts in developing effective learning models for tabular data,\ncurrent transferable tabular models remain in their infancy, limited by either\nthe lack of support for direct instruction following in new tasks or the\nneglect of acquiring foundational knowledge and capabilities from diverse\ntabular datasets. In this paper, we propose Tabular Foundation Models (TabFMs)\nto overcome these limitations. TabFMs harness the potential of generative\ntabular learning, employing a pre-trained large language model (LLM) as the\nbase model and fine-tuning it using purpose-designed objectives on an extensive\nrange of tabular datasets. This approach endows TabFMs with a profound\nunderstanding and universal capabilities essential for learning on tabular\ndata. Our evaluations underscore TabFM's effectiveness: not only does it\nsignificantly excel in instruction-following tasks like zero-shot and\nin-context inference, but it also showcases performance that approaches, and in\ninstances, even transcends, the renowned yet mysterious closed-source LLMs like\nGPT-4. Furthermore, when fine-tuning with scarce data, our model achieves\nremarkable efficiency and maintains competitive performance with abundant\ntraining data. Finally, while our results are promising, we also delve into\nTabFM's limitations and potential opportunities, aiming to stimulate and\nexpedite future research on developing more potent TabFMs.\n","authors":["Han Zhang","Xumeng Wen","Shun Zheng","Wei Xu","Jiang Bian"],"pdf_url":"https://arxiv.org/pdf/2310.07338v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.09809v3","updated":"2023-10-11T09:34:21Z","published":"2022-10-18T12:28:37Z","title":"Analysis of Convolutions, Non-linearity and Depth in Graph Neural\n Networks using Neural Tangent Kernel","summary":" The fundamental principle of Graph Neural Networks (GNNs) is to exploit the\nstructural information of the data by aggregating the neighboring nodes using a\n`graph convolution' in conjunction with a suitable choice for the network\narchitecture, such as depth and activation functions. Therefore, understanding\nthe influence of each of the design choice on the network performance is\ncrucial. Convolutions based on graph Laplacian have emerged as the dominant\nchoice with the symmetric normalization of the adjacency matrix as the most\nwidely adopted one. However, some empirical studies show that row normalization\nof the adjacency matrix outperforms it in node classification. Despite the\nwidespread use of GNNs, there is no rigorous theoretical study on the\nrepresentation power of these convolutions, that could explain this behavior.\nSimilarly, the empirical observation of the linear GNNs performance being on\npar with non-linear ReLU GNNs lacks rigorous theory.\n In this work, we theoretically analyze the influence of different aspects of\nthe GNN architecture using the Graph Neural Tangent Kernel in a semi-supervised\nnode classification setting. Under the population Degree Corrected Stochastic\nBlock Model, we prove that: (i) linear networks capture the class information\nas good as ReLU networks; (ii) row normalization preserves the underlying class\nstructure better than other convolutions; (iii) performance degrades with\nnetwork depth due to over-smoothing, but the loss in class information is the\nslowest in row normalization; (iv) skip connections retain the class\ninformation even at infinite depth, thereby eliminating over-smoothing. We\nfinally validate our theoretical findings numerically and on real datasets such\nas Cora and Citeseer.\n","authors":["Mahalakshmi Sabanayagam","Pascal Esser","Debarghya Ghoshdastidar"],"pdf_url":"https://arxiv.org/pdf/2210.09809v3.pdf","comment":"41 pages, 24 figures. Code available at\n https://github.com/mahalakshmi-sabanayagam/NTK_GCN"},{"id":"http://arxiv.org/abs/2310.07335v1","updated":"2023-10-11T09:25:24Z","published":"2023-10-11T09:25:24Z","title":"Exploring Social Motion Latent Space and Human Awareness for Effective\n Robot Navigation in Crowded Environments","summary":" This work proposes a novel approach to social robot navigation by learning to\ngenerate robot controls from a social motion latent space. By leveraging this\nsocial motion latent space, the proposed method achieves significant\nimprovements in social navigation metrics such as success rate, navigation\ntime, and trajectory length while producing smoother (less jerk and angular\ndeviations) and more anticipatory trajectories. The superiority of the proposed\nmethod is demonstrated through comparison with baseline models in various\nscenarios. Additionally, the concept of humans' awareness towards the robot is\nintroduced into the social robot navigation framework, showing that\nincorporating human awareness leads to shorter and smoother trajectories owing\nto humans' ability to positively interact with the robot.\n","authors":["Junaid Ahmed Ansari","Satyajit Tourani","Gourav Kumar","Brojeshwar Bhowmick"],"pdf_url":"https://arxiv.org/pdf/2310.07335v1.pdf","comment":"Accepted at IROS 2023"},{"id":"http://arxiv.org/abs/2310.07325v1","updated":"2023-10-11T09:14:40Z","published":"2023-10-11T09:14:40Z","title":"An Adversarial Example for Direct Logit Attribution: Memory Management\n in gelu-4l","summary":" We provide concrete evidence for memory management in a 4-layer transformer.\nSpecifically, we identify clean-up behavior, in which model components\nconsistently remove the output of preceeding components during a forward pass.\nOur findings suggest that the interpretability technique Direct Logit\nAttribution provides misleading results. We show explicit examples where this\ntechnique is inaccurate, as it does not account for clean-up behavior.\n","authors":["James Dao","Yeu-Tong Lao","Can Rager","Jett Janiak"],"pdf_url":"https://arxiv.org/pdf/2310.07325v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07323v1","updated":"2023-10-11T09:14:17Z","published":"2023-10-11T09:14:17Z","title":"Multichannel consecutive data cross-extraction with 1DCNN-attention for\n diagnosis of power transformer","summary":" Power transformer plays a critical role in grid infrastructure, and its\ndiagnosis is paramount for maintaining stable operation. However, the current\nmethods for transformer diagnosis focus on discrete dissolved gas analysis,\nneglecting deep feature extraction of multichannel consecutive data. The\nunutilized sequential data contains the significant temporal information\nreflecting the transformer condition. In light of this, the structure of\nmultichannel consecutive data cross-extraction (MCDC) is proposed in this\narticle in order to comprehensively exploit the intrinsic characteristic and\nevaluate the states of transformer. Moreover, for the better accommodation in\nscenario of transformer diagnosis, one dimensional convolution neural network\nattention (1DCNN-attention) mechanism is introduced and offers a more efficient\nsolution given the simplified spatial complexity. Finally, the effectiveness of\nMCDC and the superior generalization ability, compared with other algorithms,\nare validated in experiments conducted on a dataset collected from real\noperation cases of power transformer. Additionally, the better stability of\n1DCNN-attention has also been certified.\n","authors":["Wei Zheng","Guogang Zhang","Chenchen Zhao","Qianqian Zhu"],"pdf_url":"https://arxiv.org/pdf/2310.07323v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07321v1","updated":"2023-10-11T09:09:55Z","published":"2023-10-11T09:09:55Z","title":"On the Impact of Cross-Domain Data on German Language Models","summary":" Traditionally, large language models have been either trained on general web\ncrawls or domain-specific data. However, recent successes of generative large\nlanguage models, have shed light on the benefits of cross-domain datasets. To\nexamine the significance of prioritizing data diversity over quality, we\npresent a German dataset comprising texts from five domains, along with another\ndataset aimed at containing high-quality data. Through training a series of\nmodels ranging between 122M and 750M parameters on both datasets, we conduct a\ncomprehensive benchmark on multiple downstream tasks. Our findings demonstrate\nthat the models trained on the cross-domain dataset outperform those trained on\nquality data alone, leading to improvements up to $4.45\\%$ over the previous\nstate-of-the-art. The models are available at\nhttps://huggingface.co/ikim-uk-essen\n","authors":["Amin Dada","Aokun Chen","Cheng Peng","Kaleb E Smith","Ahmad Idrissi-Yaghir","Constantin Marc Seibold","Jianning Li","Lars Heiliger","Christoph M. Friedrich","Daniel Truhn","Jan Egger","Jiang Bian","Jens Kleesiek","Yonghui Wu"],"pdf_url":"https://arxiv.org/pdf/2310.07321v1.pdf","comment":"13 pages, 1 figure, accepted at Findings of the Association for\n Computational Linguistics: EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.07320v1","updated":"2023-10-11T09:09:50Z","published":"2023-10-11T09:09:50Z","title":"Byzantine-Resilient Decentralized Multi-Armed Bandits","summary":" In decentralized cooperative multi-armed bandits (MAB), each agent observes a\ndistinct stream of rewards, and seeks to exchange information with others to\nselect a sequence of arms so as to minimize its regret. Agents in the\ncooperative setting can outperform a single agent running a MAB method such as\nUpper-Confidence Bound (UCB) independently. In this work, we study how to\nrecover such salient behavior when an unknown fraction of the agents can be\nByzantine, that is, communicate arbitrarily wrong information in the form of\nreward mean-estimates or confidence sets. This framework can be used to model\nattackers in computer networks, instigators of offensive content into\nrecommender systems, or manipulators of financial markets. Our key contribution\nis the development of a fully decentralized resilient upper confidence bound\n(UCB) algorithm that fuses an information mixing step among agents with a\ntruncation of inconsistent and extreme values. This truncation step enables us\nto establish that the performance of each normal agent is no worse than the\nclassic single-agent UCB1 algorithm in terms of regret, and more importantly,\nthe cumulative regret of all normal agents is strictly better than the\nnon-cooperative case, provided that each agent has at least 3f+1 neighbors\nwhere f is the maximum possible Byzantine agents in each agent's neighborhood.\nExtensions to time-varying neighbor graphs, and minimax lower bounds are\nfurther established on the achievable regret. Experiments corroborate the\nmerits of this framework in practice.\n","authors":["Jingxuan Zhu","Alec Koppel","Alvaro Velasquez","Ji Liu"],"pdf_url":"https://arxiv.org/pdf/2310.07320v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2203.07941v4","updated":"2023-10-11T09:07:24Z","published":"2022-03-15T14:25:44Z","title":"Reachability In Simple Neural Networks","summary":" We investigate the complexity of the reachability problem for (deep) neural\nnetworks: does it compute valid output given some valid input? It was recently\nclaimed that the problem is NP-complete for general neural networks and\nspecifications over the input/output dimension given by conjunctions of linear\ninequalities. We recapitulate the proof and repair some flaws in the original\nupper and lower bound proofs. Motivated by the general result, we show that\nNP-hardness already holds for restricted classes of simple specifications and\nneural networks. Allowing for a single hidden layer and an output dimension of\none as well as neural networks with just one negative, zero and one positive\nweight or bias is sufficient to ensure NP-hardness. Additionally, we give a\nthorough discussion and outlook of possible extensions for this direction of\nresearch on neural network verification.\n","authors":["Marco Sälzer","Martin Lange"],"pdf_url":"https://arxiv.org/pdf/2203.07941v4.pdf","comment":"arXiv admin note: substantial text overlap with arXiv:2108.13179"},{"id":"http://arxiv.org/abs/2104.05859v5","updated":"2023-10-11T09:07:01Z","published":"2021-04-12T23:14:41Z","title":"Rapid Exploration for Open-World Navigation with Latent Goal Models","summary":" We describe a robotic learning system for autonomous exploration and\nnavigation in diverse, open-world environments. At the core of our method is a\nlearned latent variable model of distances and actions, along with a\nnon-parametric topological memory of images. We use an information bottleneck\nto regularize the learned policy, giving us (i) a compact visual representation\nof goals, (ii) improved generalization capabilities, and (iii) a mechanism for\nsampling feasible goals for exploration. Trained on a large offline dataset of\nprior experience, the model acquires a representation of visual goals that is\nrobust to task-irrelevant distractors. We demonstrate our method on a mobile\nground robot in open-world exploration scenarios. Given an image of a goal that\nis up to 80 meters away, our method leverages its representation to explore and\ndiscover the goal in under 20 minutes, even amidst previously-unseen obstacles\nand weather conditions. Please check out the project website for videos of our\nexperiments and information about the real-world dataset used at\nhttps://sites.google.com/view/recon-robot.\n","authors":["Dhruv Shah","Benjamin Eysenbach","Gregory Kahn","Nicholas Rhinehart","Sergey Levine"],"pdf_url":"https://arxiv.org/pdf/2104.05859v5.pdf","comment":"Presented at 5th Annual Conference on Robot Learning (CoRL 2021),\n London, UK as an Oral Talk. Project page and dataset release at\n https://sites.google.com/view/recon-robot"},{"id":"http://arxiv.org/abs/2310.05052v2","updated":"2023-10-11T09:04:01Z","published":"2023-10-08T07:25:27Z","title":"Learning Intra- and Inter-Cell Differences for Accurate Battery Lifespan\n Prediction across Diverse Conditions","summary":" Battery life prediction holds significant practical value for battery\nresearch and development. Currently, many data-driven models rely on early\nelectrical signals from specific target batteries to predict their lifespan. A\ncommon shortfall is that most existing methods are developed based on specific\naging conditions, which not only limits their model's capability but also\ndiminishes their effectiveness in predicting degradation under varied\nconditions. As a result, these models often miss out on fully benefiting from\nthe rich historical data available under other conditions. Here, to address\nabove, we introduce an approach that explicitly captures differences between\nelectrical signals of a target battery and a reference battery, irrespective of\ntheir materials and aging conditions, to forecast the target battery life.\nThrough this inter-cell difference, we not only enhance the feature space but\nalso pave the way for a universal battery life prediction framework.\nRemarkably, our model that combines the inter- and intra-cell differences\nshines across diverse conditions, standing out in its efficiency and accuracy\nusing all accessible datasets. An essential application of our approach is its\ncapability to leverage data from older batteries effectively, enabling newer\nbatteries to capitalize on insights gained from past batteries. This work not\nonly enriches the battery data utilization strategy but also sets the stage for\nsmarter battery management system in the future.\n","authors":["Han Zhang","Yuqi Li","Shun Zheng","Ziheng Lu","Xiaofan Gui","Wei Xu","Jiang Bian"],"pdf_url":"https://arxiv.org/pdf/2310.05052v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07313v1","updated":"2023-10-11T09:00:02Z","published":"2023-10-11T09:00:02Z","title":"Molecule-Edit Templates for Efficient and Accurate Retrosynthesis\n Prediction","summary":" Retrosynthesis involves determining a sequence of reactions to synthesize\ncomplex molecules from simpler precursors. As this poses a challenge in organic\nchemistry, machine learning has offered solutions, particularly for predicting\npossible reaction substrates for a given target molecule. These solutions\nmainly fall into template-based and template-free categories. The former is\nefficient but relies on a vast set of predefined reaction patterns, while the\nlatter, though more flexible, can be computationally intensive and less\ninterpretable. To address these issues, we introduce METRO (Molecule-Edit\nTemplates for RetrOsynthesis), a machine-learning model that predicts reactions\nusing minimal templates - simplified reaction patterns capturing only essential\nmolecular changes - reducing computational overhead and achieving\nstate-of-the-art results on standard benchmarks.\n","authors":["Mikołaj Sacha","Michał Sadowski","Piotr Kozakowski","Ruard van Workum","Stanisław Jastrzębski"],"pdf_url":"https://arxiv.org/pdf/2310.07313v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07312v1","updated":"2023-10-11T08:57:59Z","published":"2023-10-11T08:57:59Z","title":"WiGenAI: The Symphony of Wireless and Generative AI via Diffusion Models","summary":" Innovative foundation models, such as GPT-3 and stable diffusion models, have\nmade a paradigm shift in the realm of artificial intelligence (AI) towards\ngenerative AI-based systems. In unison, from data communication and networking\nperspective, AI and machine learning (AI/ML) algorithms are envisioned to be\npervasively incorporated into the future generations of wireless communications\nsystems, highlighting the need for novel AI-native solutions for the emergent\ncommunication scenarios. In this article, we outline the applications of\ngenerative AI in wireless communication systems to lay the foundations for\nresearch in this field. Diffusion-based generative models, as the new\nstate-of-the-art paradigm of generative models, are introduced, and their\napplications in wireless communication systems are discussed. Two case studies\nare also presented to showcase how diffusion models can be exploited for the\ndevelopment of resilient AI-native communication systems. Specifically, we\npropose denoising diffusion probabilistic models (DDPM) for a wireless\ncommunication scheme with non-ideal transceivers, where 30% improvement is\nachieved in terms of bit error rate. As the second application, DDPMs are\nemployed at the transmitter to shape the constellation symbols, highlighting a\nrobust out-of-distribution performance. Finally, future directions and open\nissues for the development of generative AI-based wireless systems are\ndiscussed to promote future research endeavors towards wireless generative AI\n(WiGenAI).\n","authors":["Mehdi Letafati","Samad Ali","Matti Latva-aho"],"pdf_url":"https://arxiv.org/pdf/2310.07312v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.01276v2","updated":"2023-10-11T08:57:34Z","published":"2023-06-02T05:34:01Z","title":"Enhancing Sample Efficiency in Black-box Combinatorial Optimization via\n Symmetric Replay Training","summary":" Black-box combinatorial optimization (black-box CO) is frequently encountered\nin various industrial fields, such as drug discovery or hardware design.\nDespite its widespread relevance, solving black-box CO problems is highly\nchallenging due to the vast combinatorial solution space and resource-intensive\nnature of black-box function evaluations. These inherent complexities induce\nsignificant constraints on the efficacy of existing deep reinforcement learning\n(DRL) methods when applied to practical problem settings. For efficient\nexploration with the limited availability of function evaluations, this paper\nintroduces a new generic method to enhance sample efficiency. We propose\nsymmetric replay training that leverages the high-reward samples and their\nunder-explored regions in the symmetric space. In replay training, the policy\nis trained to imitate the symmetric trajectories of these high-rewarded\nsamples. The proposed method is beneficial for the exploration of highly\nrewarded regions without the necessity for additional online interactions -\nfree. The experimental results show that our method consistently improves the\nsample efficiency of various DRL methods on real-world tasks, including\nmolecular optimization and hardware design.\n","authors":["Hyeonah Kim","Minsu Kim","Sungsoo Ahn","Jinkyoo Park"],"pdf_url":"https://arxiv.org/pdf/2306.01276v2.pdf","comment":"18 pages (including 6 pages of the appendix)"},{"id":"http://arxiv.org/abs/2310.07306v1","updated":"2023-10-11T08:40:06Z","published":"2023-10-11T08:40:06Z","title":"SNOiC: Soft Labeling and Noisy Mixup based Open Intent Classification\n Model","summary":" This paper presents a Soft Labeling and Noisy Mixup-based open intent\nclassification model (SNOiC). Most of the previous works have used\nthreshold-based methods to identify open intents, which are prone to\noverfitting and may produce biased predictions. Additionally, the need for more\navailable data for an open intent class presents another limitation for these\nexisting models. SNOiC combines Soft Labeling and Noisy Mixup strategies to\nreduce the biasing and generate pseudo-data for open intent class. The\nexperimental results on four benchmark datasets show that the SNOiC model\nachieves a minimum and maximum performance of 68.72\\% and 94.71\\%,\nrespectively, in identifying open intents. Moreover, compared to\nstate-of-the-art models, the SNOiC model improves the performance of\nidentifying open intents by 0.93\\% (minimum) and 12.76\\% (maximum). The model's\nefficacy is further established by analyzing various parameters used in the\nproposed model. An ablation study is also conducted, which involves creating\nthree model variants to validate the effectiveness of the SNOiC model.\n","authors":["Aditi Kanwar","Aditi Seetha","Satyendra Singh Chouhan","Rajdeep Niyogi"],"pdf_url":"https://arxiv.org/pdf/2310.07306v1.pdf","comment":"9 Pages, 6 figures"},{"id":"http://arxiv.org/abs/2310.00113v2","updated":"2023-10-11T08:38:28Z","published":"2023-09-29T20:01:11Z","title":"HyperMask: Adaptive Hypernetwork-based Masks for Continual Learning","summary":" Artificial neural networks suffer from catastrophic forgetting when they are\nsequentially trained on multiple tasks. To overcome this problem, there exist\nmany continual learning strategies. One of the most effective is the\nhypernetwork-based approach. The hypernetwork generates the weights of a target\nmodel based on the task's identity. The model's main limitation is that\nhypernetwork can produce completely different nests for each task.\nConsequently, each task is solved separately. The model does not use\ninformation from the network dedicated to previous tasks and practically\nproduces new architectures when it learns the subsequent tasks. To solve such a\nproblem, we use the lottery ticket hypothesis, which postulates the existence\nof sparse subnetworks, named winning tickets, that preserve the performance of\na full network. In the paper, we propose a method called HyperMask, which\ntrains a single network for all tasks. Hypernetwork produces semi-binary masks\nto obtain target subnetworks dedicated to new tasks. This solution inherits the\nability of the hypernetwork to adapt to new tasks with minimal forgetting.\nMoreover, due to the lottery ticket hypothesis, we can use a single network\nwith weighted subnets dedicated to each task.\n","authors":["Kamil Książek","Przemysław Spurek"],"pdf_url":"https://arxiv.org/pdf/2310.00113v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.05672v2","updated":"2023-10-11T08:37:40Z","published":"2023-10-09T12:42:39Z","title":"Multi-timestep models for Model-based Reinforcement Learning","summary":" In model-based reinforcement learning (MBRL), most algorithms rely on\nsimulating trajectories from one-step dynamics models learned on data. A\ncritical challenge of this approach is the compounding of one-step prediction\nerrors as length of the trajectory grows. In this paper we tackle this issue by\nusing a multi-timestep objective to train one-step models. Our objective is a\nweighted sum of a loss function (e.g., negative log-likelihood) at various\nfuture horizons. We explore and test a range of weights profiles. We find that\nexponentially decaying weights lead to models that significantly improve the\nlong-horizon R2 score. This improvement is particularly noticeable when the\nmodels were evaluated on noisy data. Finally, using a soft actor-critic (SAC)\nagent in pure batch reinforcement learning (RL) and iterated batch RL\nscenarios, we found that our multi-timestep models outperform or match standard\none-step models. This was especially evident in a noisy variant of the\nconsidered environment, highlighting the potential of our approach in\nreal-world applications.\n","authors":["Abdelhakim Benechehab","Giuseppe Paolo","Albert Thomas","Maurizio Filippone","Balázs Kégl"],"pdf_url":"https://arxiv.org/pdf/2310.05672v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07298v1","updated":"2023-10-11T08:32:46Z","published":"2023-10-11T08:32:46Z","title":"Beyond Memorization: Violating Privacy Via Inference with Large Language\n Models","summary":" Current privacy research on large language models (LLMs) primarily focuses on\nthe issue of extracting memorized training data. At the same time, models'\ninference capabilities have increased drastically. This raises the key question\nof whether current LLMs could violate individuals' privacy by inferring\npersonal attributes from text given at inference time. In this work, we present\nthe first comprehensive study on the capabilities of pretrained LLMs to infer\npersonal attributes from text. We construct a dataset consisting of real Reddit\nprofiles, and show that current LLMs can infer a wide range of personal\nattributes (e.g., location, income, sex), achieving up to $85\\%$ top-1 and\n$95.8\\%$ top-3 accuracy at a fraction of the cost ($100\\times$) and time\n($240\\times$) required by humans. As people increasingly interact with\nLLM-powered chatbots across all aspects of life, we also explore the emerging\nthreat of privacy-invasive chatbots trying to extract personal information\nthrough seemingly benign questions. Finally, we show that common mitigations,\ni.e., text anonymization and model alignment, are currently ineffective at\nprotecting user privacy against LLM inference. Our findings highlight that\ncurrent LLMs can infer personal data at a previously unattainable scale. In the\nabsence of working defenses, we advocate for a broader discussion around LLM\nprivacy implications beyond memorization, striving for a wider privacy\nprotection.\n","authors":["Robin Staab","Mark Vero","Mislav Balunović","Martin Vechev"],"pdf_url":"https://arxiv.org/pdf/2310.07298v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07297v1","updated":"2023-10-11T08:31:26Z","published":"2023-10-11T08:31:26Z","title":"Score Regularized Policy Optimization through Diffusion Behavior","summary":" Recent developments in offline reinforcement learning have uncovered the\nimmense potential of diffusion modeling, which excels at representing\nheterogeneous behavior policies. However, sampling from diffusion policies is\nconsiderably slow because it necessitates tens to hundreds of iterative\ninference steps for one action. To address this issue, we propose to extract an\nefficient deterministic inference policy from critic models and pretrained\ndiffusion behavior models, leveraging the latter to directly regularize the\npolicy gradient with the behavior distribution's score function during\noptimization. Our method enjoys powerful generative capabilities of diffusion\nmodeling while completely circumventing the computationally intensive and\ntime-consuming diffusion sampling scheme, both during training and evaluation.\nExtensive results on D4RL tasks show that our method boosts action sampling\nspeed by more than 25 times compared with various leading diffusion-based\nmethods in locomotion tasks, while still maintaining state-of-the-art\nperformance.\n","authors":["Huayu Chen","Cheng Lu","Zhengyi Wang","Hang Su","Jun Zhu"],"pdf_url":"https://arxiv.org/pdf/2310.07297v1.pdf","comment":"18 pages"},{"id":"http://arxiv.org/abs/2310.07276v1","updated":"2023-10-11T07:57:08Z","published":"2023-10-11T07:57:08Z","title":"BioT5: Enriching Cross-modal Integration in Biology with Chemical\n Knowledge and Natural Language Associations","summary":" Recent advancements in biological research leverage the integration of\nmolecules, proteins, and natural language to enhance drug discovery. However,\ncurrent models exhibit several limitations, such as the generation of invalid\nmolecular SMILES, underutilization of contextual information, and equal\ntreatment of structured and unstructured knowledge. To address these issues, we\npropose $\\mathbf{BioT5}$, a comprehensive pre-training framework that enriches\ncross-modal integration in biology with chemical knowledge and natural language\nassociations. $\\mathbf{BioT5}$ utilizes SELFIES for $100%$ robust molecular\nrepresentations and extracts knowledge from the surrounding context of\nbio-entities in unstructured biological literature. Furthermore,\n$\\mathbf{BioT5}$ distinguishes between structured and unstructured knowledge,\nleading to more effective utilization of information. After fine-tuning, BioT5\nshows superior performance across a wide range of tasks, demonstrating its\nstrong capability of capturing underlying relations and properties of\nbio-entities. Our code is available at\n$\\href{https://github.com/QizhiPei/BioT5}{Github}$.\n","authors":["Qizhi Pei","Wei Zhang","Jinhua Zhu","Kehan Wu","Kaiyuan Gao","Lijun Wu","Yingce Xia","Rui Yan"],"pdf_url":"https://arxiv.org/pdf/2310.07276v1.pdf","comment":"Empirical Methods in Natural Language Processing (EMNLP 2023)"},{"id":"http://arxiv.org/abs/2310.07269v1","updated":"2023-10-11T07:51:10Z","published":"2023-10-11T07:51:10Z","title":"Why Does Sharpness-Aware Minimization Generalize Better Than SGD?","summary":" The challenge of overfitting, in which the model memorizes the training data\nand fails to generalize to test data, has become increasingly significant in\nthe training of large neural networks. To tackle this challenge,\nSharpness-Aware Minimization (SAM) has emerged as a promising training method,\nwhich can improve the generalization of neural networks even in the presence of\nlabel noise. However, a deep understanding of how SAM works, especially in the\nsetting of nonlinear neural networks and classification tasks, remains largely\nmissing. This paper fills this gap by demonstrating why SAM generalizes better\nthan Stochastic Gradient Descent (SGD) for a certain data model and two-layer\nconvolutional ReLU networks. The loss landscape of our studied problem is\nnonsmooth, thus current explanations for the success of SAM based on the\nHessian information are insufficient. Our result explains the benefits of SAM,\nparticularly its ability to prevent noise learning in the early stages, thereby\nfacilitating more effective learning of features. Experiments on both synthetic\nand real data corroborate our theory.\n","authors":["Zixiang Chen","Junkai Zhang","Yiwen Kou","Xiangning Chen","Cho-Jui Hsieh","Quanquan Gu"],"pdf_url":"https://arxiv.org/pdf/2310.07269v1.pdf","comment":"52 pages, 4 figures, 2 tables. In NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.07268v1","updated":"2023-10-11T07:50:51Z","published":"2023-10-11T07:50:51Z","title":"RaftFed: A Lightweight Federated Learning Framework for Vehicular Crowd\n Intelligence","summary":" Vehicular crowd intelligence (VCI) is an emerging research field. Facilitated\nby state-of-the-art vehicular ad-hoc networks and artificial intelligence,\nvarious VCI applications come to place, e.g., collaborative sensing,\npositioning, and mapping. The collaborative property of VCI applications\ngenerally requires data to be shared among participants, thus forming\nnetwork-wide intelligence. How to fulfill this process without compromising\ndata privacy remains a challenging issue. Although federated learning (FL) is a\npromising tool to solve the problem, adapting conventional FL frameworks to VCI\nis nontrivial. First, the centralized model aggregation is unreliable in VCI\nbecause of the existence of stragglers with unfavorable channel conditions.\nSecond, existing FL schemes are vulnerable to Non-IID data, which is\nintensified by the data heterogeneity in VCI. This paper proposes a novel\nfederated learning framework called RaftFed to facilitate privacy-preserving\nVCI. The experimental results show that RaftFed performs better than baselines\nregarding communication overhead, model accuracy, and model convergence.\n","authors":["Changan Yang","Yaxing Chen","Yao Zhang","Helei Cui","Zhiwen Yu","Bin Guo","Zheng Yan","Zijiang Yang"],"pdf_url":"https://arxiv.org/pdf/2310.07268v1.pdf","comment":"8 pages,8 figures"},{"id":"http://arxiv.org/abs/2304.03864v2","updated":"2023-10-11T07:50:28Z","published":"2023-04-07T23:25:48Z","title":"SGDP: A Stream-Graph Neural Network Based Data Prefetcher","summary":" Data prefetching is important for storage system optimization and access\nperformance improvement. Traditional prefetchers work well for mining access\npatterns of sequential logical block address (LBA) but cannot handle complex\nnon-sequential patterns that commonly exist in real-world applications. The\nstate-of-the-art (SOTA) learning-based prefetchers cover more LBA accesses.\nHowever, they do not adequately consider the spatial interdependencies between\nLBA deltas, which leads to limited performance and robustness. This paper\nproposes a novel Stream-Graph neural network-based Data Prefetcher (SGDP).\nSpecifically, SGDP models LBA delta streams using a weighted directed graph\nstructure to represent interactive relations among LBA deltas and further\nextracts hybrid features by graph neural networks for data prefetching. We\nconduct extensive experiments on eight real-world datasets. Empirical results\nverify that SGDP outperforms the SOTA methods in terms of the hit ratio by\n6.21%, the effective prefetching ratio by 7.00%, and speeds up inference time\nby 3.13X on average. Besides, we generalize SGDP to different variants by\ndifferent stream constructions, further expanding its application scenarios and\ndemonstrating its robustness. SGDP offers a novel data prefetching solution and\nhas been verified in commercial hybrid storage systems in the experimental\nphase. Our codes and appendix are available at\nhttps://github.com/yyysjz1997/SGDP/.\n","authors":["Yiyuan Yang","Rongshang Li","Qiquan Shi","Xijun Li","Gang Hu","Xing Li","Mingxuan Yuan"],"pdf_url":"https://arxiv.org/pdf/2304.03864v2.pdf","comment":"Accepted by International Joint Conference on Neural Networks (IJCNN\n 2023)"},{"id":"http://arxiv.org/abs/2306.10347v2","updated":"2023-10-11T07:50:09Z","published":"2023-06-17T13:40:15Z","title":"DCdetector: Dual Attention Contrastive Representation Learning for Time\n Series Anomaly Detection","summary":" Time series anomaly detection is critical for a wide range of applications.\nIt aims to identify deviant samples from the normal sample distribution in time\nseries. The most fundamental challenge for this task is to learn a\nrepresentation map that enables effective discrimination of anomalies.\nReconstruction-based methods still dominate, but the representation learning\nwith anomalies might hurt the performance with its large abnormal loss. On the\nother hand, contrastive learning aims to find a representation that can clearly\ndistinguish any instance from the others, which can bring a more natural and\npromising representation for time series anomaly detection. In this paper, we\npropose DCdetector, a multi-scale dual attention contrastive representation\nlearning model. DCdetector utilizes a novel dual attention asymmetric design to\ncreate the permutated environment and pure contrastive loss to guide the\nlearning process, thus learning a permutation invariant representation with\nsuperior discrimination abilities. Extensive experiments show that DCdetector\nachieves state-of-the-art results on multiple time series anomaly detection\nbenchmark datasets. Code is publicly available at\nhttps://github.com/DAMO-DI-ML/KDD2023-DCdetector.\n","authors":["Yiyuan Yang","Chaoli Zhang","Tian Zhou","Qingsong Wen","Liang Sun"],"pdf_url":"https://arxiv.org/pdf/2306.10347v2.pdf","comment":"Accepted by ACM SIGKDD International Conference on Knowledge\n Discovery & Data Mining (KDD 2023)"},{"id":"http://arxiv.org/abs/2305.00472v2","updated":"2023-10-11T07:47:07Z","published":"2023-04-30T13:10:56Z","title":"Efficient MILP Decomposition in Quantum Computing for ReLU Network\n Robustness","summary":" Emerging quantum computing technologies, such as Noisy Intermediate-Scale\nQuantum (NISQ) devices, offer potential advancements in solving mathematical\noptimization problems. However, limitations in qubit availability, noise, and\nerrors pose challenges for practical implementation. In this study, we examine\ntwo decomposition methods for Mixed-Integer Linear Programming (MILP) designed\nto reduce the original problem size and utilize available NISQ devices more\nefficiently. We concentrate on breaking down the original problem into smaller\nsubproblems, which are then solved iteratively using a combined\nquantum-classical hardware approach. We conduct a detailed analysis for the\ndecomposition of MILP with Benders and Dantzig-Wolfe methods. In our analysis,\nwe show that the number of qubits required to solve Benders is exponentially\nlarge in the worst-case, while remains constant for Dantzig-Wolfe.\nAdditionally, we leverage Dantzig-Wolfe decomposition on the use-case of\ncertifying the robustness of ReLU networks. Our experimental results\ndemonstrate that this approach can save up to 90\\% of qubits compared to\nexisting methods on quantum annealing and gate-based quantum computers.\n","authors":["Nicola Franco","Tom Wollschläger","Benedikt Poggel","Stephan Günnemann","Jeanette Miriam Lorenz"],"pdf_url":"https://arxiv.org/pdf/2305.00472v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07264v1","updated":"2023-10-11T07:40:46Z","published":"2023-10-11T07:40:46Z","title":"Classification of Dysarthria based on the Levels of Severity. A\n Systematic Review","summary":" Dysarthria is a neurological speech disorder that can significantly impact\naffected individuals' communication abilities and overall quality of life. The\naccurate and objective classification of dysarthria and the determination of\nits severity are crucial for effective therapeutic intervention. While\ntraditional assessments by speech-language pathologists (SLPs) are common, they\nare often subjective, time-consuming, and can vary between practitioners.\nEmerging machine learning-based models have shown the potential to provide a\nmore objective dysarthria assessment, enhancing diagnostic accuracy and\nreliability. This systematic review aims to comprehensively analyze current\nmethodologies for classifying dysarthria based on severity levels.\nSpecifically, this review will focus on determining the most effective set and\ntype of features that can be used for automatic patient classification and\nevaluating the best AI techniques for this purpose. We will systematically\nreview the literature on the automatic classification of dysarthria severity\nlevels. Sources of information will include electronic databases and grey\nliterature. Selection criteria will be established based on relevance to the\nresearch questions. Data extraction will include methodologies used, the type\nof features extracted for classification, and AI techniques employed. The\nfindings of this systematic review will contribute to the current understanding\nof dysarthria classification, inform future research, and support the\ndevelopment of improved diagnostic tools. The implications of these findings\ncould be significant in advancing patient care and improving therapeutic\noutcomes for individuals affected by dysarthria.\n","authors":["Afnan Al-Ali","Somaya Al-Maadeed","Moutaz Saleh","Rani Chinnappa Naidu","Zachariah C Alex","Prakash Ramachandran","Rajeev Khoodeeram","Rajesh Kumar M"],"pdf_url":"https://arxiv.org/pdf/2310.07264v1.pdf","comment":"no comments"},{"id":"http://arxiv.org/abs/2310.07261v1","updated":"2023-10-11T07:38:37Z","published":"2023-10-11T07:38:37Z","title":"Deep ReLU networks and high-order finite element methods II: Chebyshev\n emulation","summary":" Expression rates and stability in Sobolev norms of deep ReLU neural networks\n(NNs) in terms of the number of parameters defining the NN for continuous,\npiecewise polynomial functions, on arbitrary, finite partitions $\\mathcal{T}$\nof a bounded interval $(a,b)$ are addressed. Novel constructions of ReLU NN\nsurrogates encoding the approximated functions in terms of Chebyshev polynomial\nexpansion coefficients are developed. Chebyshev coefficients can be computed\neasily from the values of the function in the Clenshaw--Curtis points using the\ninverse fast Fourier transform. Bounds on expression rates and stability that\nare superior to those of constructions based on ReLU NN emulations of monomials\nconsidered in [Opschoor, Petersen, Schwab, 2020] are obtained. All emulation\nbounds are explicit in terms of the (arbitrary) partition of the interval, the\ntarget emulation accuracy and the polynomial degree in each element of the\npartition. ReLU NN emulation error estimates are provided for various classes\nof functions and norms, commonly encountered in numerical analysis. In\nparticular, we show exponential ReLU emulation rate bounds for analytic\nfunctions with point singularities and develop an interface between Chebfun\napproximations and constructive ReLU NN emulations.\n","authors":["Joost A. A. Opschoor","Christoph Schwab"],"pdf_url":"https://arxiv.org/pdf/2310.07261v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.08786v6","updated":"2023-10-11T07:37:01Z","published":"2022-10-17T07:01:17Z","title":"Exposing Influence Campaigns in the Age of LLMs: A Behavioral-Based AI\n Approach to Detecting State-Sponsored Trolls","summary":" The detection of state-sponsored trolls operating in influence campaigns on\nsocial media is a critical and unsolved challenge for the research community,\nwhich has significant implications beyond the online realm. To address this\nchallenge, we propose a new AI-based solution that identifies troll accounts\nsolely through behavioral cues associated with their sequences of sharing\nactivity, encompassing both their actions and the feedback they receive from\nothers. Our approach does not incorporate any textual content shared and\nconsists of two steps: First, we leverage an LSTM-based classifier to determine\nwhether account sequences belong to a state-sponsored troll or an organic,\nlegitimate user. Second, we employ the classified sequences to calculate a\nmetric named the \"Troll Score\", quantifying the degree to which an account\nexhibits troll-like behavior. To assess the effectiveness of our method, we\nexamine its performance in the context of the 2016 Russian interference\ncampaign during the U.S. Presidential election. Our experiments yield\ncompelling results, demonstrating that our approach can identify account\nsequences with an AUC close to 99% and accurately differentiate between Russian\ntrolls and organic users with an AUC of 91%. Notably, our behavioral-based\napproach holds a significant advantage in the ever-evolving landscape, where\ntextual and linguistic properties can be easily mimicked by Large Language\nModels (LLMs): In contrast to existing language-based techniques, it relies on\nmore challenging-to-replicate behavioral cues, ensuring greater resilience in\nidentifying influence campaigns, especially given the potential increase in the\nusage of LLMs for generating inauthentic content. Finally, we assessed the\ngeneralizability of our solution to various entities driving different\ninformation operations and found promising results that will guide future\nresearch.\n","authors":["Fatima Ezzeddine","Luca Luceri","Omran Ayoub","Ihab Sbeity","Gianluca Nogara","Emilio Ferrara","Silvia Giordano"],"pdf_url":"https://arxiv.org/pdf/2210.08786v6.pdf","comment":"22"},{"id":"http://arxiv.org/abs/2306.01984v2","updated":"2023-10-11T07:35:27Z","published":"2023-06-03T02:46:31Z","title":"DYffusion: A Dynamics-informed Diffusion Model for Spatiotemporal\n Forecasting","summary":" While diffusion models can successfully generate data and make predictions,\nthey are predominantly designed for static images. We propose an approach for\nefficiently training diffusion models for probabilistic spatiotemporal\nforecasting, where generating stable and accurate rollout forecasts remains\nchallenging, Our method, DYffusion, leverages the temporal dynamics in the\ndata, directly coupling it with the diffusion steps in the model. We train a\nstochastic, time-conditioned interpolator and a forecaster network that mimic\nthe forward and reverse processes of standard diffusion models, respectively.\nDYffusion naturally facilitates multi-step and long-range forecasting, allowing\nfor highly flexible, continuous-time sampling trajectories and the ability to\ntrade-off performance with accelerated sampling at inference time. In addition,\nthe dynamics-informed diffusion process in DYffusion imposes a strong inductive\nbias and significantly improves computational efficiency compared to\ntraditional Gaussian noise-based diffusion models. Our approach performs\ncompetitively on probabilistic forecasting of complex dynamics in sea surface\ntemperatures, Navier-Stokes flows, and spring mesh systems.\n","authors":["Salva Rühling Cachay","Bo Zhao","Hailey Joren","Rose Yu"],"pdf_url":"https://arxiv.org/pdf/2306.01984v2.pdf","comment":"Accepted to NeurIPS 2023; Code is available at:\n https://github.com/Rose-STL-Lab/dyffusion"},{"id":"http://arxiv.org/abs/2310.07253v1","updated":"2023-10-11T07:30:18Z","published":"2023-10-11T07:30:18Z","title":"ADMEOOD: Out-of-Distribution Benchmark for Drug Property Prediction","summary":" Obtaining accurate and valid information for drug molecules is a crucial and\nchallenging task. However, chemical knowledge and information have been\naccumulated over the past 100 years from various regions, laboratories, and\nexperimental purposes. Little has been explored in terms of the\nout-of-distribution (OOD) problem with noise and inconsistency, which may lead\nto weak robustness and unsatisfied performance. This study proposes a novel\nbenchmark ADMEOOD, a systematic OOD dataset curator and benchmark specifically\ndesigned for drug property prediction. ADMEOOD obtained 27 ADME (Absorption,\nDistribution, Metabolism, Excretion) drug properties from Chembl and relevant\nliterature. Additionally, it includes two kinds of OOD data shifts: Noise Shift\nand Concept Conflict Drift (CCD). Noise Shift responds to the noise level by\ncategorizing the environment into different confidence levels. On the other\nhand, CCD describes the data which has inconsistent label among the original\ndata. Finally, it tested on a variety of domain generalization models, and the\nexperimental results demonstrate the effectiveness of the proposed partition\nmethod in ADMEOOD: ADMEOOD demonstrates a significant difference performance\nbetween in-distribution and out-of-distribution data. Moreover, ERM (Empirical\nRisk Minimization) and other models exhibit distinct trends in performance\nacross different domains and measurement types.\n","authors":["Shuoying Wei","Xinlong Wen","Lida Zhu","Songquan Li","Rongbo Zhu"],"pdf_url":"https://arxiv.org/pdf/2310.07253v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07252v1","updated":"2023-10-11T07:30:01Z","published":"2023-10-11T07:30:01Z","title":"A Comparative Study of Pre-trained CNNs and GRU-Based Attention for\n Image Caption Generation","summary":" Image captioning is a challenging task involving generating a textual\ndescription for an image using computer vision and natural language processing\ntechniques. This paper proposes a deep neural framework for image caption\ngeneration using a GRU-based attention mechanism. Our approach employs multiple\npre-trained convolutional neural networks as the encoder to extract features\nfrom the image and a GRU-based language model as the decoder to generate\ndescriptive sentences. To improve performance, we integrate the Bahdanau\nattention model with the GRU decoder to enable learning to focus on specific\nimage parts. We evaluate our approach using the MSCOCO and Flickr30k datasets\nand show that it achieves competitive scores compared to state-of-the-art\nmethods. Our proposed framework can bridge the gap between computer vision and\nnatural language and can be extended to specific domains.\n","authors":["Rashid Khan","Bingding Huang","Haseeb Hassan","Asim Zaman","Zhongfu Ye"],"pdf_url":"https://arxiv.org/pdf/2310.07252v1.pdf","comment":"15pages, 10 figures, 5 tables. 2023 the 5th International Conference\n on Robotics and Computer Vision (ICRCV 2023). arXiv admin note: substantial\n text overlap with arXiv:2203.01594"},{"id":"http://arxiv.org/abs/2310.07250v1","updated":"2023-10-11T07:27:28Z","published":"2023-10-11T07:27:28Z","title":"Synthesizing Missing MRI Sequences from Available Modalities using\n Generative Adversarial Networks in BraTS Dataset","summary":" Glioblastoma is a highly aggressive and lethal form of brain cancer. Magnetic\nresonance imaging (MRI) plays a significant role in the diagnosis, treatment\nplanning, and follow-up of glioblastoma patients due to its non-invasive and\nradiation-free nature. The International Brain Tumor Segmentation (BraTS)\nchallenge has contributed to generating numerous AI algorithms to accurately\nand efficiently segment glioblastoma sub-compartments using four structural\n(T1, T1Gd, T2, T2-FLAIR) MRI scans. However, these four MRI sequences may not\nalways be available. To address this issue, Generative Adversarial Networks\n(GANs) can be used to synthesize the missing MRI sequences. In this paper, we\nimplement and utilize an open-source GAN approach that takes any three MRI\nsequences as input to generate the missing fourth structural sequence. Our\nproposed approach is contributed to the community-driven generally nuanced deep\nlearning framework (GaNDLF) and demonstrates promising results in synthesizing\nhigh-quality and realistic MRI sequences, enabling clinicians to improve their\ndiagnostic capabilities and support the application of AI methods to brain\ntumor MRI quantification.\n","authors":["Ibrahim Ethem Hamamci"],"pdf_url":"https://arxiv.org/pdf/2310.07250v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07245v1","updated":"2023-10-11T07:22:37Z","published":"2023-10-11T07:22:37Z","title":"Crowd Counting in Harsh Weather using Image Denoising with Pix2Pix GANs","summary":" Visual crowd counting estimates the density of the crowd using deep learning\nmodels such as convolution neural networks (CNNs). The performance of the model\nheavily relies on the quality of the training data that constitutes crowd\nimages. In harsh weather such as fog, dust, and low light conditions, the\ninference performance may severely degrade on the noisy and blur images. In\nthis paper, we propose the use of Pix2Pix generative adversarial network (GAN)\nto first denoise the crowd images prior to passing them to the counting model.\nA Pix2Pix network is trained using synthetic noisy images generated from\noriginal crowd images and then the pretrained generator is then used in the\ninference engine to estimate the crowd density in unseen, noisy crowd images.\nThe performance is tested on JHU-Crowd dataset to validate the significance of\nthe proposed method particularly when high reliability and accuracy are\nrequired.\n","authors":["Muhammad Asif Khan","Hamid Menouar","Ridha Hamila"],"pdf_url":"https://arxiv.org/pdf/2310.07245v1.pdf","comment":"The paper has been accepted for presentation in IEEE 38th\n International Conference on Image and Vision Computing New Zealand (IVCNZ\n 2023). The final manuscript can be accessed at ieeexplore"},{"id":"http://arxiv.org/abs/2310.07241v1","updated":"2023-10-11T07:13:16Z","published":"2023-10-11T07:13:16Z","title":"Surrogate modeling for stochastic crack growth processes in structural\n health monitoring applications","summary":" Fatigue crack growth is one of the most common types of deterioration in\nmetal structures with significant implications on their reliability. Recent\nadvances in Structural Health Monitoring (SHM) have motivated the use of\nstructural response data to predict future crack growth under uncertainty, in\norder to enable a transition towards predictive maintenance. Accurately\nrepresenting different sources of uncertainty in stochastic crack growth (SCG)\nprocesses is a non-trivial task. The present work builds on previous research\non physics-based SCG modeling under both material and load-related uncertainty.\nThe aim here is to construct computationally efficient, probabilistic surrogate\nmodels for SCG processes that successfully encode these different sources of\nuncertainty. An approach inspired by latent variable modeling is employed that\nutilizes Gaussian Process (GP) regression models to enable the surrogates to be\nused to generate prior distributions for different Bayesian SHM tasks as the\napplication of interest. Implementation is carried out in a numerical setting\nand model performance is assessed for two fundamental crack SHM problems;\nnamely crack length monitoring (damage quantification) and crack growth\nmonitoring (damage prognosis).\n","authors":["Nicholas E. Silionis","Konstantinos N. Anyfantis"],"pdf_url":"https://arxiv.org/pdf/2310.07241v1.pdf","comment":"20 pages, 9 figures. Preprint submitted to Elsevier journal"},{"id":"http://arxiv.org/abs/2310.07240v1","updated":"2023-10-11T07:08:20Z","published":"2023-10-11T07:08:20Z","title":"CacheGen: Fast Context Loading for Language Model Applications","summary":" As large language models (LLMs) take on more complex tasks, their inputs\nincorporate longer contexts to respond to questions that require domain\nknowledge or user-specific conversational histories. Yet, using long contexts\nposes a challenge for responsive LLM systems, as nothing can be generated until\nall the contexts are fetched to and processed by the LLM. Existing systems\noptimize only the computation delay in context processing (e.g., by caching\nintermediate key-value features of the text context) but often cause longer\nnetwork delays in context fetching (e.g., key-value features consume orders of\nmagnitude larger bandwidth than the text context).\n This paper presents CacheGen to minimize the delays in fetching and\nprocessing contexts for LLMs. CacheGen reduces the bandwidth needed for\ntransmitting long contexts' key-value (KV) features through a novel encoder\nthat compresses KV features into more compact bitstream representations. The\nencoder combines adaptive quantization with a tailored arithmetic coder, taking\nadvantage of the KV features' distributional properties, such as locality\nacross tokens. Furthermore, CacheGen minimizes the total delay in fetching and\nprocessing a context by using a controller that determines when to load the\ncontext as compressed KV features or raw text and picks the appropriate\ncompression level if loaded as KV features. We test CacheGen on three models of\nvarious sizes and three datasets of different context lengths. Compared to\nrecent methods that handle long contexts, CacheGen reduces bandwidth usage by\n3.7-4.3x and the total delay in fetching and processing contexts by 2.7-3x\nwhile maintaining similar LLM performance on various tasks as loading the text\ncontexts.\n","authors":["Yuhan Liu","Hanchen Li","Kuntai Du","Jiayi Yao","Yihua Cheng","Yuyang Huang","Shan Lu","Michael Maire","Henry Hoffmann","Ari Holtzman","Ganesh Ananthanarayanan","Junchen Jiang"],"pdf_url":"https://arxiv.org/pdf/2310.07240v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07235v1","updated":"2023-10-11T06:53:05Z","published":"2023-10-11T06:53:05Z","title":"Are GATs Out of Balance?","summary":" While the expressive power and computational capabilities of graph neural\nnetworks (GNNs) have been theoretically studied, their optimization and\nlearning dynamics, in general, remain largely unexplored. Our study undertakes\nthe Graph Attention Network (GAT), a popular GNN architecture in which a node's\nneighborhood aggregation is weighted by parameterized attention coefficients.\nWe derive a conservation law of GAT gradient flow dynamics, which explains why\na high portion of parameters in GATs with standard initialization struggle to\nchange during training. This effect is amplified in deeper GATs, which perform\nsignificantly worse than their shallow counterparts. To alleviate this problem,\nwe devise an initialization scheme that balances the GAT network. Our approach\ni) allows more effective propagation of gradients and in turn enables\ntrainability of deeper networks, and ii) attains a considerable speedup in\ntraining and convergence time in comparison to the standard initialization. Our\nmain theorem serves as a stepping stone to studying the learning dynamics of\npositive homogeneous models with attention mechanisms.\n","authors":["Nimrah Mustafa","Aleksandar Bojchevski","Rebekka Burkholz"],"pdf_url":"https://arxiv.org/pdf/2310.07235v1.pdf","comment":"24 pages. To be published in Advances in Neural Information\n Processing Systems (NeurIPS), 2023"},{"id":"http://arxiv.org/abs/2310.07234v1","updated":"2023-10-11T06:51:46Z","published":"2023-10-11T06:51:46Z","title":"Hierarchical Decomposition of Prompt-Based Continual Learning:\n Rethinking Obscured Sub-optimality","summary":" Prompt-based continual learning is an emerging direction in leveraging\npre-trained knowledge for downstream continual learning, and has almost reached\nthe performance pinnacle under supervised pre-training. However, our empirical\nresearch reveals that the current strategies fall short of their full potential\nunder the more realistic self-supervised pre-training, which is essential for\nhandling vast quantities of unlabeled data in practice. This is largely due to\nthe difficulty of task-specific knowledge being incorporated into instructed\nrepresentations via prompt parameters and predicted by uninstructed\nrepresentations at test time. To overcome the exposed sub-optimality, we\nconduct a theoretical analysis of the continual learning objective in the\ncontext of pre-training, and decompose it into hierarchical components:\nwithin-task prediction, task-identity inference, and task-adaptive prediction.\nFollowing these empirical and theoretical insights, we propose Hierarchical\nDecomposition (HiDe-)Prompt, an innovative approach that explicitly optimizes\nthe hierarchical components with an ensemble of task-specific prompts and\nstatistics of both uninstructed and instructed representations, further with\nthe coordination of a contrastive regularization strategy. Our extensive\nexperiments demonstrate the superior performance of HiDe-Prompt and its\nrobustness to pre-training paradigms in continual learning (e.g., up to 15.01%\nand 9.61% lead on Split CIFAR-100 and Split ImageNet-R, respectively). Our code\nis available at \\url{https://github.com/thu-ml/HiDe-Prompt}.\n","authors":["Liyuan Wang","Jingyi Xie","Xingxing Zhang","Mingyi Huang","Hang Su","Jun Zhu"],"pdf_url":"https://arxiv.org/pdf/2310.07234v1.pdf","comment":"23 pages, 20 figures, 11 tables, accepted by NeurIPS as a Spotlight"},{"id":"http://arxiv.org/abs/2212.12180v4","updated":"2023-10-11T06:49:08Z","published":"2022-12-23T07:42:56Z","title":"Autothrottle: A Practical Bi-Level Approach to Resource Management for\n SLO-Targeted Microservices","summary":" Achieving resource efficiency while preserving end-user experience is\nnon-trivial for cloud application operators. As cloud applications\nprogressively adopt microservices, resource managers are faced with two\ndistinct levels of system behavior: end-to-end application latency and\nper-service resource usage. Translating between the two levels, however, is\nchallenging because user requests traverse heterogeneous services that\ncollectively (but unevenly) contribute to the end-to-end latency. We present\nAutothrottle, a bi-level resource management framework for microservices with\nlatency SLOs (service-level objectives). It architecturally decouples\napplication SLO feedback from service resource control, and bridges them\nthrough the notion of performance targets. Specifically, an application-wide\nlearning-based controller is employed to periodically set performance targets\n-- expressed as CPU throttle ratios -- for per-service heuristic controllers to\nattain. We evaluate Autothrottle on three microservice applications, with\nworkload traces from production scenarios. Results show superior CPU savings,\nup to 26.21% over the best-performing baseline and up to 93.84% over all\nbaselines.\n","authors":["Zibo Wang","Pinghe Li","Chieh-Jan Mike Liang","Feng Wu","Francis Y. Yan"],"pdf_url":"https://arxiv.org/pdf/2212.12180v4.pdf","comment":"Accepted by USENIX NSDI '24"},{"id":"http://arxiv.org/abs/2310.07229v1","updated":"2023-10-11T06:36:23Z","published":"2023-10-11T06:36:23Z","title":"Self-supervised Pocket Pretraining via Protein Fragment-Surroundings\n Alignment","summary":" Pocket representations play a vital role in various biomedical applications,\nsuch as druggability estimation, ligand affinity prediction, and de novo drug\ndesign. While existing geometric features and pretrained representations have\ndemonstrated promising results, they usually treat pockets independent of\nligands, neglecting the fundamental interactions between them. However, the\nlimited pocket-ligand complex structures available in the PDB database (less\nthan 100 thousand non-redundant pairs) hampers large-scale pretraining\nendeavors for interaction modeling. To address this constraint, we propose a\nnovel pocket pretraining approach that leverages knowledge from high-resolution\natomic protein structures, assisted by highly effective pretrained small\nmolecule representations. By segmenting protein structures into drug-like\nfragments and their corresponding pockets, we obtain a reasonable simulation of\nligand-receptor interactions, resulting in the generation of over 5 million\ncomplexes. Subsequently, the pocket encoder is trained in a contrastive manner\nto align with the representation of pseudo-ligand furnished by some pretrained\nsmall molecule encoders. Our method, named ProFSA, achieves state-of-the-art\nperformance across various tasks, including pocket druggability prediction,\npocket matching, and ligand binding affinity prediction. Notably, ProFSA\nsurpasses other pretraining methods by a substantial margin. Moreover, our work\nopens up a new avenue for mitigating the scarcity of protein-ligand complex\ndata through the utilization of high-quality and diverse protein structure\ndatabases.\n","authors":["Bowen Gao","Yinjun Jia","Yuanle Mo","Yuyan Ni","Weiying Ma","Zhiming Ma","Yanyan Lan"],"pdf_url":"https://arxiv.org/pdf/2310.07229v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2211.12421v3","updated":"2023-10-11T06:32:32Z","published":"2022-11-11T02:14:28Z","title":"Data-Driven Network Neuroscience: On Data Collection and Benchmark","summary":" This paper presents a comprehensive and quality collection of functional\nhuman brain \\emph{network} data for potential research in the intersection of\nneuroscience, machine learning, and graph analytics. Anatomical and functional\nMRI images have been used to understand the functional connectivity of the\nhuman brain and are particularly important in identifying underlying\nneurodegenerative conditions such as Alzheimer's, Parkinson's, and Autism.\nRecently, the study of the brain in the form of brain networks using machine\nlearning and graph analytics has become increasingly popular, especially to\npredict the early onset of these conditions. A brain network, represented as a\ngraph, retains rich structural and positional information that traditional\nexamination methods are unable to capture. However, the lack of publicly\naccessible brain network data prevents researchers from data-driven\nexplorations. One of the main difficulties lies in the complicated\ndomain-specific preprocessing steps and the exhaustive computation required to\nconvert the data from MRI images into brain networks. We bridge this gap by\ncollecting a large amount of MRI images from public databases and a private\nsource, working with domain experts to make sensible design choices, and\npreprocessing the MRI images to produce a collection of brain network datasets.\nThe datasets originate from 6 different sources, cover 4 brain conditions, and\nconsist of a total of 2,702 subjects. We test our graph datasets on 12 machine\nlearning models to provide baselines and validate the data quality on a recent\ngraph analysis model. To lower the barrier to entry and promote the research in\nthis interdisciplinary field, we release our brain network data and complete\npreprocessing details including codes at\nhttps://doi.org/10.17608/k6.auckland.21397377 and\nhttps://figshare.com/s/fa33c10664ca08b022ce.\n","authors":["Jiaxing Xu","Yunhan Yang","David Tse Jung Huang","Sophi Shilpa Gururajapathy","Yiping Ke","Miao Qiao","Alan Wang","Haribalan Kumar","Josh McGeown","Eryn Kwon"],"pdf_url":"https://arxiv.org/pdf/2211.12421v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.17382v2","updated":"2023-10-11T06:18:04Z","published":"2023-09-29T16:36:39Z","title":"Reason for Future, Act for Now: A Principled Framework for Autonomous\n LLM Agents with Provable Sample Efficiency","summary":" Large language models (LLMs) demonstrate impressive reasoning abilities, but\ntranslating reasoning into actions in the real world remains challenging. In\nparticular, it remains unclear how to complete a given task provably within a\nminimum number of interactions with the external environment, e.g., through an\ninternal mechanism of reasoning. To this end, we propose a principled framework\nwith provable regret guarantees to orchestrate reasoning and acting, which we\ncall \"reason for future, act for now\" (\\texttt{RAFA}). Specifically, we design\na prompt template for reasoning that learns from the memory buffer and plans a\nfuture trajectory over a long horizon (\"reason for future\"). At each step, the\nLLM agent takes the initial action of the planned trajectory (\"act for now\"),\nstores the collected feedback in the memory buffer, and reinvokes the reasoning\nroutine to replan the future trajectory from the new state.\n The key idea is to cast reasoning in LLMs as learning and planning in\nBayesian adaptive Markov decision processes (MDPs). Correspondingly, we prompt\nLLMs to form an updated posterior of the unknown environment from the memory\nbuffer (learning) and generate an optimal trajectory for multiple future steps\nthat maximizes a value function (planning). The learning and planning\nsubroutines are performed in an \"in-context\" manner to emulate the actor-critic\nupdate for MDPs. Our theoretical analysis proves that the novel combination of\nlong-term reasoning and short-term acting achieves a $\\sqrt{T}$ regret. In\nparticular, the regret bound highlights an intriguing interplay between the\nprior knowledge obtained through pretraining and the uncertainty reduction\nachieved by reasoning and acting. Our empirical validation shows that it\noutperforms various existing frameworks and achieves nearly perfect scores on a\nfew benchmarks.\n","authors":["Zhihan Liu","Hao Hu","Shenao Zhang","Hongyi Guo","Shuqi Ke","Boyi Liu","Zhaoran Wang"],"pdf_url":"https://arxiv.org/pdf/2309.17382v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07223v1","updated":"2023-10-11T06:13:50Z","published":"2023-10-11T06:13:50Z","title":"Deep Learning for blind spectral unmixing of LULC classes with MODIS\n multispectral time series and ancillary data","summary":" Remotely sensed data are dominated by mixed Land Use and Land Cover (LULC)\ntypes. Spectral unmixing is a technique to extract information from mixed\npixels into their constituent LULC types and corresponding abundance fractions.\nTraditionally, solving this task has relied on either classical methods that\nrequire prior knowledge of endmembers or machine learning methods that avoid\nexplicit endmembers calculation, also known as blind spectral unmixing (BSU).\nMost BSU studies based on Deep Learning (DL) focus on one time-step\nhyperspectral data, yet its acquisition remains quite costly compared with\nmultispectral data. To our knowledge, here we provide the first study on BSU of\nLULC classes using multispectral time series data with DL models. We further\nboost the performance of a Long-Short Term Memory (LSTM)-based model by\nincorporating geographic plus topographic (geo-topographic) and climatic\nancillary information. Our experiments show that combining spectral-temporal\ninput data together with geo-topographic and climatic information substantially\nimproves the abundance estimation of LULC classes in mixed pixels. To carry out\nthis study, we built a new labeled dataset of the region of Andalusia (Spain)\nwith monthly multispectral time series of pixels for the year 2013 from MODIS\nat 460m resolution, for two hierarchical levels of LULC classes, named\nAndalusia MultiSpectral MultiTemporal Unmixing (Andalusia-MSMTU). This dataset\nprovides, at the pixel level, a multispectral time series plus ancillary\ninformation annotated with the abundance of each LULC class inside each pixel.\nThe dataset and code are available to the public.\n","authors":["José Rodríguez-Ortega","Rohaifa Khaldi","Domingo Alcaraz-Segura","Siham Tabik"],"pdf_url":"https://arxiv.org/pdf/2310.07223v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07221v1","updated":"2023-10-11T06:11:11Z","published":"2023-10-11T06:11:11Z","title":"Using Learnable Physics for Real-Time Exercise Form Recommendations","summary":" Good posture and form are essential for safe and productive exercising. Even\nin gym settings, trainers may not be readily available for feedback.\nRehabilitation therapies and fitness workouts can thus benefit from recommender\nsystems that provide real-time evaluation. In this paper, we present an\nalgorithmic pipeline that can diagnose problems in exercise techniques and\noffer corrective recommendations, with high sensitivity and specificity in\nreal-time. We use MediaPipe for pose recognition, count repetitions using\npeak-prominence detection, and use a learnable physics simulator to track\nmotion evolution for each exercise. A test video is diagnosed based on\ndeviations from the prototypical learned motion using statistical learning. The\nsystem is evaluated on six full and upper body exercises. These real-time\nrecommendations, counseled via low-cost equipment like smartphones, will allow\nexercisers to rectify potential mistakes making self-practice feasible while\nreducing the risk of workout injuries.\n","authors":["Abhishek Jaiswal","Gautam Chauhan","Nisheeth Srivastava"],"pdf_url":"https://arxiv.org/pdf/2310.07221v1.pdf","comment":"Accepted by ACM RecSys '23, 12 pages , 7 Figures"},{"id":"http://arxiv.org/abs/2310.07220v1","updated":"2023-10-11T06:10:07Z","published":"2023-10-11T06:10:07Z","title":"COPlanner: Plan to Roll Out Conservatively but to Explore Optimistically\n for Model-Based RL","summary":" Dyna-style model-based reinforcement learning contains two phases: model\nrollouts to generate sample for policy learning and real environment\nexploration using current policy for dynamics model learning. However, due to\nthe complex real-world environment, it is inevitable to learn an imperfect\ndynamics model with model prediction error, which can further mislead policy\nlearning and result in sub-optimal solutions. In this paper, we propose\n$\\texttt{COPlanner}$, a planning-driven framework for model-based methods to\naddress the inaccurately learned dynamics model problem with conservative model\nrollouts and optimistic environment exploration. $\\texttt{COPlanner}$ leverages\nan uncertainty-aware policy-guided model predictive control (UP-MPC) component\nto plan for multi-step uncertainty estimation. This estimated uncertainty then\nserves as a penalty during model rollouts and as a bonus during real\nenvironment exploration respectively, to choose actions. Consequently,\n$\\texttt{COPlanner}$ can avoid model uncertain regions through conservative\nmodel rollouts, thereby alleviating the influence of model error.\nSimultaneously, it explores high-reward model uncertain regions to reduce model\nerror actively through optimistic real environment exploration.\n$\\texttt{COPlanner}$ is a plug-and-play framework that can be applied to any\ndyna-style model-based methods. Experimental results on a series of\nproprioceptive and visual continuous control tasks demonstrate that both sample\nefficiency and asymptotic performance of strong model-based methods are\nsignificantly improved combined with $\\texttt{COPlanner}$.\n","authors":["Xiyao Wang","Ruijie Zheng","Yanchao Sun","Ruonan Jia","Wichayaporn Wongkamjan","Huazhe Xu","Furong Huang"],"pdf_url":"https://arxiv.org/pdf/2310.07220v1.pdf","comment":"20 pages, 12 figures"},{"id":"http://arxiv.org/abs/2310.07219v1","updated":"2023-10-11T06:09:48Z","published":"2023-10-11T06:09:48Z","title":"Improved Membership Inference Attacks Against Language Classification\n Models","summary":" Artificial intelligence systems are prevalent in everyday life, with use\ncases in retail, manufacturing, health, and many other fields. With the rise in\nAI adoption, associated risks have been identified, including privacy risks to\nthe people whose data was used to train models. Assessing the privacy risks of\nmachine learning models is crucial to enabling knowledgeable decisions on\nwhether to use, deploy, or share a model. A common approach to privacy risk\nassessment is to run one or more known attacks against the model and measure\ntheir success rate. We present a novel framework for running membership\ninference attacks against classification models. Our framework takes advantage\nof the ensemble method, generating many specialized attack models for different\nsubsets of the data. We show that this approach achieves higher accuracy than\neither a single attack model or an attack model per class label, both on\nclassical and language classification tasks.\n","authors":["Shlomit Shachor","Natalia Razinkov","Abigail Goldsteen"],"pdf_url":"https://arxiv.org/pdf/2310.07219v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07217v1","updated":"2023-10-11T06:09:14Z","published":"2023-10-11T06:09:14Z","title":"Enhancing Neural Architecture Search with Multiple Hardware Constraints\n for Deep Learning Model Deployment on Tiny IoT Devices","summary":" The rapid proliferation of computing domains relying on Internet of Things\n(IoT) devices has created a pressing need for efficient and accurate\ndeep-learning (DL) models that can run on low-power devices. However,\ntraditional DL models tend to be too complex and computationally intensive for\ntypical IoT end-nodes. To address this challenge, Neural Architecture Search\n(NAS) has emerged as a popular design automation technique for co-optimizing\nthe accuracy and complexity of deep neural networks. Nevertheless, existing NAS\ntechniques require many iterations to produce a network that adheres to\nspecific hardware constraints, such as the maximum memory available on the\nhardware or the maximum latency allowed by the target application. In this\nwork, we propose a novel approach to incorporate multiple constraints into\nso-called Differentiable NAS optimization methods, which allows the generation,\nin a single shot, of a model that respects user-defined constraints on both\nmemory and latency in a time comparable to a single standard training. The\nproposed approach is evaluated on five IoT-relevant benchmarks, including the\nMLPerf Tiny suite and Tiny ImageNet, demonstrating that, with a single search,\nit is possible to reduce memory and latency by 87.4% and 54.2%, respectively\n(as defined by our targets), while ensuring non-inferior accuracy on\nstate-of-the-art hand-tuned deep neural networks for TinyML.\n","authors":["Alessio Burrello","Matteo Risso","Beatrice Alessandra Motetti","Enrico Macii","Luca Benini","Daniele Jahier Pagliari"],"pdf_url":"https://arxiv.org/pdf/2310.07217v1.pdf","comment":"Accepted for publication at the IEEE Transactions on Emerging Topics\n in Computing"},{"id":"http://arxiv.org/abs/2310.07216v1","updated":"2023-10-11T06:04:40Z","published":"2023-10-11T06:04:40Z","title":"Generative Modeling on Manifolds Through Mixture of Riemannian Diffusion\n Processes","summary":" Learning the distribution of data on Riemannian manifolds is crucial for\nmodeling data from non-Euclidean space, which is required by many applications\nfrom diverse scientific fields. Yet, existing generative models on manifolds\nsuffer from expensive divergence computation or rely on approximations of heat\nkernel. These limitations restrict their applicability to simple geometries and\nhinder scalability to high dimensions. In this work, we introduce the\nRiemannian Diffusion Mixture, a principled framework for building a generative\nprocess on manifolds as a mixture of endpoint-conditioned diffusion processes\ninstead of relying on the denoising approach of previous diffusion models, for\nwhich the generative process is characterized by its drift guiding toward the\nmost probable endpoint with respect to the geometry of the manifold. We further\npropose a simple yet efficient training objective for learning the mixture\nprocess, that is readily applicable to general manifolds. Our method\noutperforms previous generative models on various manifolds while scaling to\nhigh dimensions and requires a dramatically reduced number of in-training\nsimulation steps for general manifolds.\n","authors":["Jaehyeong Jo","Sung Ju Hwang"],"pdf_url":"https://arxiv.org/pdf/2310.07216v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.04895v2","updated":"2023-10-11T05:59:53Z","published":"2023-10-07T18:47:17Z","title":"Cell Tracking-by-detection using Elliptical Bounding Boxes","summary":" Cell detection and tracking are paramount for bio-analysis. Recent approaches\nrely on the tracking-by-model evolution paradigm, which usually consists of\ntraining end-to-end deep learning models to detect and track the cells on the\nframes with promising results. However, such methods require extensive amounts\nof annotated data, which is time-consuming to obtain and often requires\nspecialized annotators. This work proposes a new approach based on the\nclassical tracking-by-detection paradigm that alleviates the requirement of\nannotated data. More precisely, it approximates the cell shapes as oriented\nellipses and then uses generic-purpose oriented object detectors to identify\nthe cells in each frame. We then rely on a global data association algorithm\nthat explores temporal cell similarity using probability distance metrics,\nconsidering that the ellipses relate to two-dimensional Gaussian distributions.\nOur results show that our method can achieve detection and tracking results\ncompetitively with state-of-the-art techniques that require considerably more\nextensive data annotation. Our code is available at:\nhttps://github.com/LucasKirsten/Deep-Cell-Tracking-EBB.\n","authors":["Lucas N. Kirsten","Cláudio R. Jung"],"pdf_url":"https://arxiv.org/pdf/2310.04895v2.pdf","comment":"Paper under review on IEEE/ACM Transactions on Computational Biology\n and Bioinformatics"},{"id":"http://arxiv.org/abs/2310.07211v1","updated":"2023-10-11T05:55:20Z","published":"2023-10-11T05:55:20Z","title":"Bridging the Gap between Newton-Raphson Method and Regularized Policy\n Iteration","summary":" Regularization is one of the most important techniques in reinforcement\nlearning algorithms. The well-known soft actor-critic algorithm is a special\ncase of regularized policy iteration where the regularizer is chosen as Shannon\nentropy. Despite some empirical success of regularized policy iteration, its\ntheoretical underpinnings remain unclear. This paper proves that regularized\npolicy iteration is strictly equivalent to the standard Newton-Raphson method\nin the condition of smoothing out Bellman equation with strongly convex\nfunctions. This equivalence lays the foundation of a unified analysis for both\nglobal and local convergence behaviors of regularized policy iteration. We\nprove that regularized policy iteration has global linear convergence with the\nrate being $\\gamma$ (discount factor). Furthermore, this algorithm converges\nquadratically once it enters a local region around the optimal value. We also\nshow that a modified version of regularized policy iteration, i.e., with\nfinite-step policy evaluation, is equivalent to inexact Newton method where the\nNewton iteration formula is solved with truncated iterations. We prove that the\nassociated algorithm achieves an asymptotic linear convergence rate of\n$\\gamma^M$ in which $M$ denotes the number of steps carried out in policy\nevaluation. Our results take a solid step towards a better understanding of the\nconvergence properties of regularized policy iteration algorithms.\n","authors":["Zeyang Li","Chuxiong Hu","Yunan Wang","Guojian Zhan","Jie Li","Shengbo Eben Li"],"pdf_url":"https://arxiv.org/pdf/2310.07211v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07207v1","updated":"2023-10-11T05:34:46Z","published":"2023-10-11T05:34:46Z","title":"Robust Safe Reinforcement Learning under Adversarial Disturbances","summary":" Safety is a primary concern when applying reinforcement learning to\nreal-world control tasks, especially in the presence of external disturbances.\nHowever, existing safe reinforcement learning algorithms rarely account for\nexternal disturbances, limiting their applicability and robustness in practice.\nTo address this challenge, this paper proposes a robust safe reinforcement\nlearning framework that tackles worst-case disturbances. First, this paper\npresents a policy iteration scheme to solve for the robust invariant set, i.e.,\na subset of the safe set, where persistent safety is only possible for states\nwithin. The key idea is to establish a two-player zero-sum game by leveraging\nthe safety value function in Hamilton-Jacobi reachability analysis, in which\nthe protagonist (i.e., control inputs) aims to maintain safety and the\nadversary (i.e., external disturbances) tries to break down safety. This paper\nproves that the proposed policy iteration algorithm converges monotonically to\nthe maximal robust invariant set. Second, this paper integrates the proposed\npolicy iteration scheme into a constrained reinforcement learning algorithm\nthat simultaneously synthesizes the robust invariant set and uses it for\nconstrained policy optimization. This algorithm tackles both optimality and\nsafety, i.e., learning a policy that attains high rewards while maintaining\nsafety under worst-case disturbances. Experiments on classic control tasks show\nthat the proposed method achieves zero constraint violation with learned\nworst-case adversarial disturbances, while other baseline algorithms violate\nthe safety constraints substantially. Our proposed method also attains\ncomparable performance as the baselines even in the absence of the adversary.\n","authors":["Zeyang Li","Chuxiong Hu","Shengbo Eben Li","Jia Cheng","Yunan Wang"],"pdf_url":"https://arxiv.org/pdf/2310.07207v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07204v1","updated":"2023-10-11T05:32:29Z","published":"2023-10-11T05:32:29Z","title":"State of the Art on Diffusion Models for Visual Computing","summary":" The field of visual computing is rapidly advancing due to the emergence of\ngenerative artificial intelligence (AI), which unlocks unprecedented\ncapabilities for the generation, editing, and reconstruction of images, videos,\nand 3D scenes. In these domains, diffusion models are the generative AI\narchitecture of choice. Within the last year alone, the literature on\ndiffusion-based tools and applications has seen exponential growth and relevant\npapers are published across the computer graphics, computer vision, and AI\ncommunities with new works appearing daily on arXiv. This rapid growth of the\nfield makes it difficult to keep up with all recent developments. The goal of\nthis state-of-the-art report (STAR) is to introduce the basic mathematical\nconcepts of diffusion models, implementation details and design choices of the\npopular Stable Diffusion model, as well as overview important aspects of these\ngenerative AI tools, including personalization, conditioning, inversion, among\nothers. Moreover, we give a comprehensive overview of the rapidly growing\nliterature on diffusion-based generation and editing, categorized by the type\nof generated medium, including 2D images, videos, 3D objects, locomotion, and\n4D scenes. Finally, we discuss available datasets, metrics, open challenges,\nand social implications. This STAR provides an intuitive starting point to\nexplore this exciting topic for researchers, artists, and practitioners alike.\n","authors":["Ryan Po","Wang Yifan","Vladislav Golyanik","Kfir Aberman","Jonathan T. Barron","Amit H. Bermano","Eric Ryan Chan","Tali Dekel","Aleksander Holynski","Angjoo Kanazawa","C. Karen Liu","Lingjie Liu","Ben Mildenhall","Matthias Nießner","Björn Ommer","Christian Theobalt","Peter Wonka","Gordon Wetzstein"],"pdf_url":"https://arxiv.org/pdf/2310.07204v1.pdf","comment":null}],"Multimedia":[{"id":"http://arxiv.org/abs/2310.07517v1","updated":"2023-10-11T14:15:25Z","published":"2023-10-11T14:15:25Z","title":"CM-PIE: Cross-modal perception for interactive-enhanced audio-visual\n video parsing","summary":" Audio-visual video parsing is the task of categorizing a video at the segment\nlevel with weak labels, and predicting them as audible or visible events.\nRecent methods for this task leverage the attention mechanism to capture the\nsemantic correlations among the whole video across the audio-visual modalities.\nHowever, these approaches have overlooked the importance of individual segments\nwithin a video and the relationship among them, and tend to rely on a single\nmodality when learning features. In this paper, we propose a novel\ninteractive-enhanced cross-modal perception method~(CM-PIE), which can learn\nfine-grained features by applying a segment-based attention module.\nFurthermore, a cross-modal aggregation block is introduced to jointly optimize\nthe semantic representation of audio and visual signals by enhancing\ninter-modal interactions. The experimental results show that our model offers\nimproved parsing performance on the Look, Listen, and Parse dataset compared to\nother methods.\n","authors":["Yaru Chen","Ruohao Guo","Xubo Liu","Peipei Wu","Guangyao Li","Zhenbo Li","Wenwu Wang"],"pdf_url":"https://arxiv.org/pdf/2310.07517v1.pdf","comment":"5 pages, 3 figures, 15 references"},{"id":"http://arxiv.org/abs/2304.11161v2","updated":"2023-10-11T13:29:23Z","published":"2023-04-02T16:03:44Z","title":"altiro3D: Scene representation from single image and novel view\n synthesis","summary":" We introduce altiro3D, a free extended library developed to represent reality\nstarting from a given original RGB image or flat video. It allows to generate a\nlight-field (or Native) image or video and get a realistic 3D experience. To\nsynthesize N-number of virtual images and add them sequentially into a Quilt\ncollage, we apply MiDaS models for the monocular depth estimation, simple\nOpenCV and Telea inpainting techniques to map all pixels, and implement a\n'Fast' algorithm to handle 3D projection camera and scene transformations along\nN-viewpoints. We use the degree of depth to move proportionally the pixels,\nassuming the original image to be at the center of all the viewpoints. altiro3D\ncan also be used with DIBR algorithm to compute intermediate snapshots from a\nequivalent 'Real (slower)' camera with N-geometric viewpoints, which requires\nto calibrate a priori several intrinsic and extrinsic camera parameters. We\nadopt a pixel- and device-based Lookup Table to optimize computing time. The\nmultiple viewpoints and video generated from a single image or frame can be\ndisplayed in a free-view LCD display.\n","authors":["E. Canessa","L. Tenze"],"pdf_url":"https://arxiv.org/pdf/2304.11161v2.pdf","comment":"In press (2023) Springer International Journal of Information\n Technology (IJIT) 10 pages, 3 figures"},{"id":"http://arxiv.org/abs/2310.07376v1","updated":"2023-10-11T10:50:15Z","published":"2023-10-11T10:50:15Z","title":"Point Cloud Denoising and Outlier Detection with Local Geometric\n Structure by Dynamic Graph CNN","summary":" The digitalization of society is rapidly developing toward the realization of\nthe digital twin and metaverse. In particular, point clouds are attracting\nattention as a media format for 3D space. Point cloud data is contaminated with\nnoise and outliers due to measurement errors. Therefore, denoising and outlier\ndetection are necessary for point cloud processing. Among them, PointCleanNet\nis an effective method for point cloud denoising and outlier detection.\nHowever, it does not consider the local geometric structure of the patch. We\nsolve this problem by applying two types of graph convolutional layer designed\nbased on the Dynamic Graph CNN. Experimental results show that the proposed\nmethods outperform the conventional method in AUPR, which indicates outlier\ndetection accuracy, and Chamfer Distance, which indicates denoising accuracy.\n","authors":["Kosuke Nakayama","Hiroto Fukuta","Hiroshi Watanabe"],"pdf_url":"https://arxiv.org/pdf/2310.07376v1.pdf","comment":"2023 IEEE 12th Global Conference on Consumer Electronics (GCCE 2023)"},{"id":"http://arxiv.org/abs/2310.07287v1","updated":"2023-10-11T08:18:51Z","published":"2023-10-11T08:18:51Z","title":"Interactive Interior Design Recommendation via Coarse-to-fine Multimodal\n Reinforcement Learning","summary":" Personalized interior decoration design often incurs high labor costs. Recent\nefforts in developing intelligent interior design systems have focused on\ngenerating textual requirement-based decoration designs while neglecting the\nproblem of how to mine homeowner's hidden preferences and choose the proper\ninitial design. To fill this gap, we propose an Interactive Interior Design\nRecommendation System (IIDRS) based on reinforcement learning (RL). IIDRS aims\nto find an ideal plan by interacting with the user, who provides feedback on\nthe gap between the recommended plan and their ideal one. To improve\ndecision-making efficiency and effectiveness in large decoration spaces, we\npropose a Decoration Recommendation Coarse-to-Fine Policy Network (DecorRCFN).\nAdditionally, to enhance generalization in online scenarios, we propose an\nobject-aware feedback generation method that augments model training with\ndiversified and dynamic textual feedback. Extensive experiments on a real-world\ndataset demonstrate our method outperforms traditional methods by a large\nmargin in terms of recommendation accuracy. Further user studies demonstrate\nthat our method reaches higher real-world user satisfaction than baseline\nmethods.\n","authors":["He Zhang","Ying Sun","Weiyu Guo","Yafei Liu","Haonan Lu","Xiaodong Lin","Hui Xiong"],"pdf_url":"https://arxiv.org/pdf/2310.07287v1.pdf","comment":"Accepted by ACM International Conference on Multimedia'23. 9 pages, 7\n figures"},{"id":"http://arxiv.org/abs/2310.07236v1","updated":"2023-10-11T06:56:08Z","published":"2023-10-11T06:56:08Z","title":"AdaMesh: Personalized Facial Expressions and Head Poses for\n Speech-Driven 3D Facial Animation","summary":" Speech-driven 3D facial animation aims at generating facial movements that\nare synchronized with the driving speech, which has been widely explored\nrecently. Existing works mostly neglect the person-specific talking style in\ngeneration, including facial expression and head pose styles. Several works\nintend to capture the personalities by fine-tuning modules. However, limited\ntraining data leads to the lack of vividness. In this work, we propose AdaMesh,\na novel adaptive speech-driven facial animation approach, which learns the\npersonalized talking style from a reference video of about 10 seconds and\ngenerates vivid facial expressions and head poses. Specifically, we propose\nmixture-of-low-rank adaptation (MoLoRA) to fine-tune the expression adapter,\nwhich efficiently captures the facial expression style. For the personalized\npose style, we propose a pose adapter by building a discrete pose prior and\nretrieving the appropriate style embedding with a semantic-aware pose style\nmatrix without fine-tuning. Extensive experimental results show that our\napproach outperforms state-of-the-art methods, preserves the talking style in\nthe reference video, and generates vivid facial animation. The supplementary\nvideo and code will be available at https://adamesh.github.io.\n","authors":["Liyang Chen","Weihong Bao","Shun Lei","Boshi Tang","Zhiyong Wu","Shiyin Kang","Haozhi Huang"],"pdf_url":"https://arxiv.org/pdf/2310.07236v1.pdf","comment":"Project Page: https://adamesh.github.io"},{"id":"http://arxiv.org/abs/2310.04673v3","updated":"2023-10-11T02:55:54Z","published":"2023-10-07T03:17:59Z","title":"LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT","summary":" Generative Pre-trained Transformer (GPT) models have achieved remarkable\nperformance on various natural language processing tasks. However, there has\nbeen limited research on applying similar frameworks to audio tasks. Previously\nproposed large language models for audio tasks either lack sufficient\nquantitative evaluations, or are limited to tasks for recognizing and\nunderstanding audio content, or significantly underperform existing\nstate-of-the-art (SOTA) models. In this paper, we propose LauraGPT, a unified\nGPT model for audio recognition, understanding, and generation. LauraGPT is a\nversatile language model that can process both audio and text inputs and\ngenerate outputs in either modalities. It can perform a wide range of tasks\nrelated to content, semantics, paralinguistics, and audio-signal analysis. Some\nof its noteworthy tasks include automatic speech recognition, speech-to-text\ntranslation, text-to-speech synthesis, machine translation, speech enhancement,\nautomated audio captioning, speech emotion recognition, and spoken language\nunderstanding. To achieve this goal, we use a combination of continuous and\ndiscrete features for audio. We encode input audio into continuous\nrepresentations using an audio encoder and decode output audio from discrete\ncodec codes. We then fine-tune a large decoder-only Transformer-based language\nmodel on multiple audio-to-text, text-to-audio, audio-to-audio, and\ntext-to-text tasks using a supervised multitask learning approach. Extensive\nexperiments show that LauraGPT achieves competitive or superior performance\ncompared to existing SOTA models on various audio processing benchmarks.\n","authors":["Jiaming Wang","Zhihao Du","Qian Chen","Yunfei Chu","Zhifu Gao","Zerui Li","Kai Hu","Xiaohuan Zhou","Jin Xu","Ziyang Ma","Wen Wang","Siqi Zheng","Chang Zhou","Zhijie Yan","Shiliang Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.04673v3.pdf","comment":"10 pages, under review"},{"id":"http://arxiv.org/abs/2310.07121v1","updated":"2023-10-11T01:51:19Z","published":"2023-10-11T01:51:19Z","title":"Motion Vector-Domain Video Steganalysis Exploiting Skipped Macroblocks","summary":" Video steganography has the potential to be used to convey illegal\ninformation, and video steganalysis is a vital tool to detect the presence of\nthis illicit act. Currently, all the motion vector (MV)-based video\nsteganalysis algorithms extract feature sets directly on the MVs, but ignoring\nthe steganograhic operation may perturb the statistics distribution of other\nvideo encoding elements, such as the skipped macroblocks (no direct MVs). This\npaper proposes a novel 11-dimensional feature set to detect MV-based video\nsteganography based on the above observation. The proposed feature is extracted\nbased on the skipped macroblocks by recompression calibration. Specifically,\nthe feature consists of two components. The first is the probability\ndistribution of motion vector prediction (MVP) difference, and the second is\nthe probability distribution of partition state transfer. Extensive experiments\non different conditions demonstrate that the proposed feature set achieves good\ndetection accuracy, especially in lower embedding capacity. In addition, the\nloss of detection performance caused by recompression calibration using\nmismatched quantization parameters (QP) is within the acceptable range, so the\nproposed method can be used in practical scenarios.\n","authors":["Jun Li","Minqing Zhang","Ke Niu","Yingnan Zhang","Xiaoyuan Yang"],"pdf_url":"https://arxiv.org/pdf/2310.07121v1.pdf","comment":null}]},"2023-10-12T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2310.08582v1","updated":"2023-10-12T17:59:50Z","published":"2023-10-12T17:59:50Z","title":"Tree-Planner: Efficient Close-loop Task Planning with Large Language\n Models","summary":" This paper studies close-loop task planning, which refers to the process of\ngenerating a sequence of skills (a plan) to accomplish a specific goal while\nadapting the plan based on real-time observations. Recently, prompting Large\nLanguage Models (LLMs) to generate actions iteratively has become a prevalent\nparadigm due to its superior performance and user-friendliness. However, this\nparadigm is plagued by two inefficiencies: high token consumption and redundant\nerror correction, both of which hinder its scalability for large-scale testing\nand applications. To address these issues, we propose Tree-Planner, which\nreframes task planning with LLMs into three distinct phases: plan sampling,\naction tree construction, and grounded deciding. Tree-Planner starts by using\nan LLM to sample a set of potential plans before execution, followed by the\naggregation of them to form an action tree. Finally, the LLM performs a\ntop-down decision-making process on the tree, taking into account real-time\nenvironmental information. Experiments show that Tree-Planner achieves\nstate-of-the-art performance while maintaining high efficiency. By decomposing\nLLM queries into a single plan-sampling call and multiple grounded-deciding\ncalls, a considerable part of the prompt are less likely to be repeatedly\nconsumed. As a result, token consumption is reduced by 92.2% compared to the\npreviously best-performing model. Additionally, by enabling backtracking on the\naction tree as needed, the correction process becomes more flexible, leading to\na 40.5% decrease in error corrections. Project page:\nhttps://tree-planner.github.io/\n","authors":["Mengkang Hu","Yao Mu","Xinmiao Yu","Mingyu Ding","Shiguang Wu","Wenqi Shao","Qiguang Chen","Bin Wang","Yu Qiao","Ping Luo"],"pdf_url":"https://arxiv.org/pdf/2310.08582v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08577v1","updated":"2023-10-12T17:59:30Z","published":"2023-10-12T17:59:30Z","title":"Visual Data-Type Understanding does not emerge from Scaling\n Vision-Language Models","summary":" Recent advances in the development of vision-language models (VLMs) are\nyielding remarkable success in recognizing visual semantic content, including\nimpressive instances of compositional image understanding. Here, we introduce\nthe novel task of \\textit{Visual Data-Type Identification}, a basic perceptual\nskill with implications for data curation (e.g., noisy data-removal from large\ndatasets, domain-specific retrieval) and autonomous vision (e.g.,\ndistinguishing changing weather conditions from camera lens staining). We\ndevelop two datasets consisting of animal images altered across a diverse set\nof 27 visual \\textit{data-types}, spanning four broad categories. An extensive\nzero-shot evaluation of 39 VLMs, ranging from 100M to 80B parameters, shows a\nnuanced performance landscape. While VLMs are reasonably good at identifying\ncertain stylistic \\textit{data-types}, such as cartoons and sketches, they\nstruggle with simpler \\textit{data-types} arising from basic manipulations like\nimage rotations or additive noise. Our findings reveal that (i) model scaling\nalone yields marginal gains for contrastively-trained models like CLIP, and\n(ii) there is a pronounced drop in performance for the largest\nauto-regressively trained VLMs like OpenFlamingo. This finding points to a\nblind spot in current frontier VLMs: they excel in recognizing semantic content\nbut fail to acquire an understanding of visual \\textit{data-types} through\nscaling. By analyzing the pre-training distributions of these models and\nincorporating \\textit{data-type} information into the captions during\nfine-tuning, we achieve a significant enhancement in performance. By exploring\nthis previously uncharted task, we aim to set the stage for further advancing\nVLMs to equip them with visual data-type understanding. Code and datasets are\nreleased \\href{https://github.com/bethgelab/DataTypeIdentification}{here}.\n","authors":["Vishaal Udandarao","Max F. Burg","Samuel Albanie","Matthias Bethge"],"pdf_url":"https://arxiv.org/pdf/2310.08577v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08566v1","updated":"2023-10-12T17:55:02Z","published":"2023-10-12T17:55:02Z","title":"Transformers as Decision Makers: Provable In-Context Reinforcement\n Learning via Supervised Pretraining","summary":" Large transformer models pretrained on offline reinforcement learning\ndatasets have demonstrated remarkable in-context reinforcement learning (ICRL)\ncapabilities, where they can make good decisions when prompted with interaction\ntrajectories from unseen environments. However, when and how transformers can\nbe trained to perform ICRL have not been theoretically well-understood. In\nparticular, it is unclear which reinforcement-learning algorithms transformers\ncan perform in context, and how distribution mismatch in offline training data\naffects the learned algorithms. This paper provides a theoretical framework\nthat analyzes supervised pretraining for ICRL. This includes two recently\nproposed training methods -- algorithm distillation and decision-pretrained\ntransformers. First, assuming model realizability, we prove the\nsupervised-pretrained transformer will imitate the conditional expectation of\nthe expert algorithm given the observed trajectory. The generalization error\nwill scale with model capacity and a distribution divergence factor between the\nexpert and offline algorithms. Second, we show transformers with ReLU attention\ncan efficiently approximate near-optimal online reinforcement learning\nalgorithms like LinUCB and Thompson sampling for stochastic linear bandits, and\nUCB-VI for tabular Markov decision processes. This provides the first\nquantitative analysis of the ICRL capabilities of transformers pretrained from\noffline trajectories.\n","authors":["Licong Lin","Yu Bai","Song Mei"],"pdf_url":"https://arxiv.org/pdf/2310.08566v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08559v1","updated":"2023-10-12T17:51:10Z","published":"2023-10-12T17:51:10Z","title":"Phenomenal Yet Puzzling: Testing Inductive Reasoning Capabilities of\n Language Models with Hypothesis Refinement","summary":" The ability to derive underlying principles from a handful of observations\nand then generalize to novel situations -- known as inductive reasoning -- is\ncentral to human intelligence. Prior work suggests that language models (LMs)\noften fall short on inductive reasoning, despite achieving impressive success\non research benchmarks. In this work, we conduct a systematic study of the\ninductive reasoning capabilities of LMs through iterative hypothesis\nrefinement, a technique that more closely mirrors the human inductive process\nthan standard input-output prompting. Iterative hypothesis refinement employs a\nthree-step process: proposing, selecting, and refining hypotheses in the form\nof textual rules. By examining the intermediate rules, we observe that LMs are\nphenomenal hypothesis proposers (i.e., generating candidate rules), and when\ncoupled with a (task-specific) symbolic interpreter that is able to\nsystematically filter the proposed set of rules, this hybrid approach achieves\nstrong results across inductive reasoning benchmarks that require inducing\ncausal relations, language-like instructions, and symbolic concepts. However,\nthey also behave as puzzling inductive reasoners, showing notable performance\ngaps in rule induction (i.e., identifying plausible rules) and rule application\n(i.e., applying proposed rules to instances), suggesting that LMs are proposing\nhypotheses without being able to actually apply the rules. Through empirical\nand human analyses, we further reveal several discrepancies between the\ninductive reasoning processes of LMs and humans, shedding light on both the\npotentials and limitations of using LMs in inductive reasoning tasks.\n","authors":["Linlu Qiu","Liwei Jiang","Ximing Lu","Melanie Sclar","Valentina Pyatkin","Chandra Bhagavatula","Bailin Wang","Yoon Kim","Yejin Choi","Nouha Dziri","Xiang Ren"],"pdf_url":"https://arxiv.org/pdf/2310.08559v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.11606v2","updated":"2023-10-12T17:50:38Z","published":"2023-08-22T17:53:55Z","title":"StoryBench: A Multifaceted Benchmark for Continuous Story Visualization","summary":" Generating video stories from text prompts is a complex task. In addition to\nhaving high visual quality, videos need to realistically adhere to a sequence\nof text prompts whilst being consistent throughout the frames. Creating a\nbenchmark for video generation requires data annotated over time, which\ncontrasts with the single caption used often in video datasets. To fill this\ngap, we collect comprehensive human annotations on three existing datasets, and\nintroduce StoryBench: a new, challenging multi-task benchmark to reliably\nevaluate forthcoming text-to-video models. Our benchmark includes three video\ngeneration tasks of increasing difficulty: action execution, where the next\naction must be generated starting from a conditioning video; story\ncontinuation, where a sequence of actions must be executed starting from a\nconditioning video; and story generation, where a video must be generated from\nonly text prompts. We evaluate small yet strong text-to-video baselines, and\nshow the benefits of training on story-like data algorithmically generated from\nexisting video captions. Finally, we establish guidelines for human evaluation\nof video stories, and reaffirm the need of better automatic metrics for video\ngeneration. StoryBench aims at encouraging future research efforts in this\nexciting new area.\n","authors":["Emanuele Bugliarello","Hernan Moraldo","Ruben Villegas","Mohammad Babaeizadeh","Mohammad Taghi Saffar","Han Zhang","Dumitru Erhan","Vittorio Ferrari","Pieter-Jan Kindermans","Paul Voigtlaender"],"pdf_url":"https://arxiv.org/pdf/2308.11606v2.pdf","comment":"NeurIPS D&B 2023"},{"id":"http://arxiv.org/abs/2310.00737v2","updated":"2023-10-12T17:39:04Z","published":"2023-10-01T17:25:56Z","title":"GenAI Against Humanity: Nefarious Applications of Generative Artificial\n Intelligence and Large Language Models","summary":" Generative Artificial Intelligence (GenAI) and Large Language Models (LLMs)\nare marvels of technology; celebrated for their prowess in natural language\nprocessing and multimodal content generation, they promise a transformative\nfuture. But as with all powerful tools, they come with their shadows. Picture\nliving in a world where deepfakes are indistinguishable from reality, where\nsynthetic identities orchestrate malicious campaigns, and where targeted\nmisinformation or scams are crafted with unparalleled precision. Welcome to the\ndarker side of GenAI applications. This article is not just a journey through\nthe meanders of potential misuse of GenAI and LLMs, but also a call to\nrecognize the urgency of the challenges ahead. As we navigate the seas of\nmisinformation campaigns, malicious content generation, and the eerie creation\nof sophisticated malware, we'll uncover the societal implications that ripple\nthrough the GenAI revolution we are witnessing. From AI-powered botnets on\nsocial media platforms to the unnerving potential of AI to generate fabricated\nidentities, or alibis made of synthetic realities, the stakes have never been\nhigher. The lines between the virtual and the real worlds are blurring, and the\nconsequences of potential GenAI's nefarious applications impact us all. This\narticle serves both as a synthesis of rigorous research presented on the risks\nof GenAI and misuse of LLMs and as a thought-provoking vision of the different\ntypes of harmful GenAI applications we might encounter in the near future, and\nsome ways we can prepare for them.\n","authors":["Emilio Ferrara"],"pdf_url":"https://arxiv.org/pdf/2310.00737v2.pdf","comment":"Submitted to CACM (Viewpoint)"},{"id":"http://arxiv.org/abs/2306.07951v2","updated":"2023-10-12T17:34:12Z","published":"2023-06-13T17:48:27Z","title":"Questioning the Survey Responses of Large Language Models","summary":" As large language models increase in capability, researchers have started to\nconduct surveys of all kinds on these models with varying scientific\nmotivations. In this work, we examine what we can learn from language models'\nsurvey responses on the basis of the well-established American Community Survey\n(ACS) by the U.S. Census Bureau. Using a de-facto standard multiple-choice\nprompting technique and evaluating 40 different language models, hundreds of\nthousands of times each on questions from the ACS, we systematically establish\ntwo dominant patterns. First, models have significant position and labeling\nbiases, for example, towards survey responses labeled with the letter \"A\".\nSecond, when adjusting for labeling biases through randomized answer ordering,\nmodels across the board trend towards uniformly random survey responses. In\nfact, binary classifiers can almost perfectly differentiate between models'\nresponses to the ACS and the responses of the US census. Taken together, our\nfindings suggest caution in treating survey responses from language models as\nequivalent to those of human populations at present time.\n","authors":["Ricardo Dominguez-Olmedo","Moritz Hardt","Celestine Mendler-Dünner"],"pdf_url":"https://arxiv.org/pdf/2306.07951v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08540v1","updated":"2023-10-12T17:32:09Z","published":"2023-10-12T17:32:09Z","title":"Do pretrained Transformers Really Learn In-context by Gradient Descent?","summary":" Is In-Context Learning (ICL) implicitly equivalent to Gradient Descent (GD)?\nSeveral recent works draw analogies between the dynamics of GD and the emergent\nbehavior of ICL in large language models. However, these works make assumptions\nfar from the realistic natural language setting in which language models are\ntrained. Such discrepancies between theory and practice, therefore, necessitate\nfurther investigation to validate their applicability.\n We start by highlighting the weaknesses in prior works that construct\nTransformer weights to simulate gradient descent. Their experiments with\ntraining Transformers on ICL objective, inconsistencies in the order\nsensitivity of ICL and GD, sparsity of the constructed weights, and sensitivity\nto parameter changes are some examples of a mismatch from the real-world\nsetting.\n Furthermore, we probe and compare the ICL vs. GD hypothesis in a natural\nsetting. We conduct comprehensive empirical analyses on language models\npretrained on natural data (LLaMa-7B). Our comparisons on various performance\nmetrics highlight the inconsistent behavior of ICL and GD as a function of\nvarious factors such as datasets, models, and number of demonstrations. We\nobserve that ICL and GD adapt the output distribution of language models\ndifferently. These results indicate that the equivalence between ICL and GD is\nan open hypothesis, requires nuanced considerations and calls for further\nstudies.\n","authors":["Lingfeng Shen","Aayush Mishra","Daniel Khashabi"],"pdf_url":"https://arxiv.org/pdf/2310.08540v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.05173v2","updated":"2023-10-12T17:25:44Z","published":"2023-09-11T00:02:05Z","title":"DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning","summary":" Prompt tuning (PT), where a small amount of trainable soft (continuous)\nprompt vectors is affixed to the input of language models (LM), has shown\npromising results across various tasks and models for parameter-efficient\nfine-tuning (PEFT). PT stands out from other PEFT approaches because it\nmaintains competitive performance with fewer trainable parameters and does not\ndrastically scale up its parameters as the model size expands. However, PT\nintroduces additional soft prompt tokens, leading to longer input sequences,\nwhich significantly impacts training and inference time and memory usage due to\nthe Transformer's quadratic complexity. Particularly concerning for Large\nLanguage Models (LLMs) that face heavy daily querying. To address this issue,\nwe propose Decomposed Prompt Tuning (DePT), which decomposes the soft prompt\ninto a shorter soft prompt and a pair of low-rank matrices that are then\noptimised with two different learning rates. This allows DePT to achieve better\nperformance while saving over 20% memory and time costs compared to vanilla PT\nand its variants, without changing trainable parameter sizes. Through extensive\nexperiments on 23 natural language processing (NLP) and vision-language (VL)\ntasks, we demonstrate that DePT outperforms state-of-the-art PEFT approaches,\nincluding the full fine-tuning baseline in some scenarios. Additionally, we\nempirically show that DEPT grows more efficient as the model size increases.\nOur further study reveals that DePT integrates seamlessly with\nparameter-efficient transfer learning in the few-shot learning setting and\nhighlights its adaptability to various model architectures and sizes.\n","authors":["Zhengxiang Shi","Aldo Lipani"],"pdf_url":"https://arxiv.org/pdf/2309.05173v2.pdf","comment":"Code is available at https://github.com/ZhengxiangShi/DePT"},{"id":"http://arxiv.org/abs/2310.08535v1","updated":"2023-10-12T17:24:15Z","published":"2023-10-12T17:24:15Z","title":"Formally Specifying the High-Level Behavior of LLM-Based Agents","summary":" LLM-based agents have recently emerged as promising tools for solving\nchallenging problems without the need for task-specific finetuned models that\ncan be expensive to procure. Currently, the design and implementation of such\nagents is ad hoc, as the wide variety of tasks that LLM-based agents may be\napplied to naturally means there can be no one-size-fits-all approach to agent\ndesign. In this work we aim to alleviate the difficulty of designing and\nimplementing new agents by proposing a minimalistic, high-level generation\nframework that simplifies the process of building agents. The framework we\nintroduce allows the user to specify desired agent behaviors in Linear Temporal\nLogic (LTL). The declarative LTL specification is then used to construct a\nconstrained decoder that guarantees the LLM will produce an output exhibiting\nthe desired behavior. By designing our framework in this way, we obtain several\nbenefits, including the ability to enforce complex agent behavior, the ability\nto formally validate prompt examples, and the ability to seamlessly incorporate\ncontent-focused logical constraints into generation. In particular, our\ndeclarative approach, in which the desired behavior is simply described without\nconcern for how it should be implemented or enforced, enables rapid design,\nimplementation and experimentation with different LLM-based agents. We\ndemonstrate how the proposed framework can be used to implement recent\nLLM-based agents, and show how the guardrails our approach provides can lead to\nimprovements in agent performance. In addition, we release our code for general\nuse.\n","authors":["Maxwell Crouse","Ibrahim Abdelaziz","Kinjal Basu","Soham Dan","Sadhana Kumaravel","Achille Fokoue","Pavan Kapanipathi","Luis Lastras"],"pdf_url":"https://arxiv.org/pdf/2310.08535v1.pdf","comment":"Preprint under review"},{"id":"http://arxiv.org/abs/2302.07863v4","updated":"2023-10-12T17:23:56Z","published":"2023-02-15T18:55:29Z","title":"Speculative Decoding with Big Little Decoder","summary":" The recent emergence of Large Language Models based on the Transformer\narchitecture has enabled dramatic advancements in the field of Natural Language\nProcessing. However, these models have long inference latency, which limits\ntheir deployment and makes them prohibitively expensive for various real-time\napplications. The inference latency is further exacerbated by autoregressive\ngenerative tasks, as models need to run iteratively to generate tokens\nsequentially without leveraging token-level parallelization. To address this,\nwe propose Big Little Decoder (BiLD), a framework that can improve inference\nefficiency and latency for a wide range of text generation applications. The\nBiLD framework contains two models with different sizes that collaboratively\ngenerate text. The small model runs autoregressively to generate text with a\nlow inference cost, and the large model is only invoked occasionally to refine\nthe small model's inaccurate predictions in a non-autoregressive manner. To\ncoordinate the small and large models, BiLD introduces two simple yet effective\npolicies: (1) the fallback policy that determines when to hand control over to\nthe large model; and (2) the rollback policy that determines when the large\nmodel needs to correct the small model's inaccurate predictions. To evaluate\nour framework across different tasks and models, we apply BiLD to various text\ngeneration scenarios encompassing machine translation on IWSLT 2017 De-En and\nWMT 2014 De-En, and summarization on XSUM and CNN/DailyMail. On an NVIDIA T4\nGPU, our framework achieves a speedup of up to 2.12x speedup with minimal\ngeneration quality degradation. Furthermore, our framework is fully\nplug-and-play and can be applied without any modifications in the training\nprocess or model architecture. Our code is open-sourced\n","authors":["Sehoon Kim","Karttikeya Mangalam","Suhong Moon","Jitendra Malik","Michael W. Mahoney","Amir Gholami","Kurt Keutzer"],"pdf_url":"https://arxiv.org/pdf/2302.07863v4.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.08523v1","updated":"2023-10-12T17:17:27Z","published":"2023-10-12T17:17:27Z","title":"LLM-augmented Preference Learning from Natural Language","summary":" Finding preferences expressed in natural language is an important but\nchallenging task. State-of-the-art(SotA) methods leverage transformer-based\nmodels such as BERT, RoBERTa, etc. and graph neural architectures such as graph\nattention networks. Since Large Language Models (LLMs) are equipped to deal\nwith larger context lengths and have much larger model sizes than the\ntransformer-based model, we investigate their ability to classify comparative\ntext directly. This work aims to serve as a first step towards using LLMs for\nthe CPC task. We design and conduct a set of experiments that format the\nclassification task into an input prompt for the LLM and a methodology to get a\nfixed-format response that can be automatically evaluated. Comparing\nperformances with existing methods, we see that pre-trained LLMs are able to\noutperform the previous SotA models with no fine-tuning involved. Our results\nshow that the LLMs can consistently outperform the SotA when the target text is\nlarge -- i.e. composed of multiple sentences --, and are still comparable to\nthe SotA performance in shorter text. We also find that few-shot learning\nyields better performance than zero-shot learning.\n","authors":["Inwon Kang","Sikai Ruan","Tyler Ho","Jui-Chien Lin","Farhad Mohsin","Oshani Seneviratne","Lirong Xia"],"pdf_url":"https://arxiv.org/pdf/2310.08523v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2204.07580v2","updated":"2023-10-12T17:07:29Z","published":"2022-04-15T13:02:33Z","title":"mGPT: Few-Shot Learners Go Multilingual","summary":" Recent studies report that autoregressive language models can successfully\nsolve many NLP tasks via zero- and few-shot learning paradigms, which opens up\nnew possibilities for using the pre-trained language models. This paper\nintroduces two autoregressive GPT-like models with 1.3 billion and 13 billion\nparameters trained on 60 languages from 25 language families using Wikipedia\nand Colossal Clean Crawled Corpus. We reproduce the GPT-3 architecture using\nGPT-2 sources and the sparse attention mechanism; Deepspeed and Megatron\nframeworks allow us to parallelize the training and inference steps\neffectively. The resulting models show performance on par with the recently\nreleased XGLM models by Facebook, covering more languages and enhancing NLP\npossibilities for low resource languages of CIS countries and Russian small\nnations. We detail the motivation for the choices of the architecture design,\nthoroughly describe the data preparation pipeline, and train five small\nversions of the model to choose the most optimal multilingual tokenization\nstrategy. We measure the model perplexity in all covered languages and evaluate\nit on the wide spectre of multilingual tasks, including classification,\ngenerative, sequence labeling and knowledge probing. The models were evaluated\nwith the zero-shot and few-shot methods. Furthermore, we compared the\nclassification tasks with the state-of-the-art multilingual model XGLM. source\ncode and the mGPT XL model are publicly released.\n","authors":["Oleh Shliazhko","Alena Fenogenova","Maria Tikhonova","Vladislav Mikhailov","Anastasia Kozlova","Tatiana Shavrina"],"pdf_url":"https://arxiv.org/pdf/2204.07580v2.pdf","comment":"Accepted for publication at Transactions of the Association for\n Computational Linguistics (TACL) To be presented at the Conference on\n Empirical Methods in Natural Language Processing (EMNLP 2023)"},{"id":"http://arxiv.org/abs/2310.08511v1","updated":"2023-10-12T17:06:19Z","published":"2023-10-12T17:06:19Z","title":"HoneyBee: Progressive Instruction Finetuning of Large Language Models\n for Materials Science","summary":" We propose an instruction-based process for trustworthy data curation in\nmaterials science (MatSci-Instruct), which we then apply to finetune a\nLLaMa-based language model targeted for materials science (HoneyBee).\nMatSci-Instruct helps alleviate the scarcity of relevant, high-quality\nmaterials science textual data available in the open literature, and HoneyBee\nis the first billion-parameter language model specialized to materials science.\nIn MatSci-Instruct we improve the trustworthiness of generated data by\nprompting multiple commercially available large language models for generation\nwith an Instructor module (e.g. Chat-GPT) and verification from an independent\nVerifier module (e.g. Claude). Using MatSci-Instruct, we construct a dataset of\nmultiple tasks and measure the quality of our dataset along multiple\ndimensions, including accuracy against known facts, relevance to materials\nscience, as well as completeness and reasonableness of the data. Moreover, we\niteratively generate more targeted instructions and instruction-data in a\nfinetuning-evaluation-feedback loop leading to progressively better performance\nfor our finetuned HoneyBee models. Our evaluation on the MatSci-NLP benchmark\nshows HoneyBee's outperformance of existing language models on materials\nscience tasks and iterative improvement in successive stages of\ninstruction-data refinement. We study the quality of HoneyBee's language\nmodeling through automatic evaluation and analyze case studies to further\nunderstand the model's capabilities and limitations. Our code and relevant\ndatasets are publicly available at\n\\url{https://github.com/BangLab-UdeM-Mila/NLP4MatSci-HoneyBee}.\n","authors":["Yu Song","Santiago Miret","Huan Zhang","Bang Liu"],"pdf_url":"https://arxiv.org/pdf/2310.08511v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08496v1","updated":"2023-10-12T16:55:44Z","published":"2023-10-12T16:55:44Z","title":"The Uncertainty-based Retrieval Framework for Ancient Chinese CWS and\n POS","summary":" Automatic analysis for modern Chinese has greatly improved the accuracy of\ntext mining in related fields, but the study of ancient Chinese is still\nrelatively rare. Ancient text division and lexical annotation are important\nparts of classical literature comprehension, and previous studies have tried to\nconstruct auxiliary dictionary and other fused knowledge to improve the\nperformance. In this paper, we propose a framework for ancient Chinese Word\nSegmentation and Part-of-Speech Tagging that makes a twofold effort: on the one\nhand, we try to capture the wordhood semantics; on the other hand, we\nre-predict the uncertain samples of baseline model by introducing external\nknowledge. The performance of our architecture outperforms pre-trained BERT\nwith CRF and existing tools such as Jiayan.\n","authors":["Pengyu Wang","Zhichen Ren"],"pdf_url":"https://arxiv.org/pdf/2310.08496v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08491v1","updated":"2023-10-12T16:50:08Z","published":"2023-10-12T16:50:08Z","title":"Prometheus: Inducing Fine-grained Evaluation Capability in Language\n Models","summary":" Recently, using a powerful proprietary Large Language Model (LLM) (e.g.,\nGPT-4) as an evaluator for long-form responses has become the de facto\nstandard. However, for practitioners with large-scale evaluation tasks and\ncustom criteria in consideration (e.g., child-readability), using proprietary\nLLMs as an evaluator is unreliable due to the closed-source nature,\nuncontrolled versioning, and prohibitive costs. In this work, we propose\nPrometheus, a fully open-source LLM that is on par with GPT-4's evaluation\ncapabilities when the appropriate reference materials (reference answer, score\nrubric) are accompanied. We first construct the Feedback Collection, a new\ndataset that consists of 1K fine-grained score rubrics, 20K instructions, and\n100K responses and language feedback generated by GPT-4. Using the Feedback\nCollection, we train Prometheus, a 13B evaluator LLM that can assess any given\nlong-form text based on customized score rubric provided by the user.\nExperimental results show that Prometheus scores a Pearson correlation of 0.897\nwith human evaluators when evaluating with 45 customized score rubrics, which\nis on par with GPT-4 (0.882), and greatly outperforms ChatGPT (0.392).\nFurthermore, measuring correlation with GPT-4 with 1222 customized score\nrubrics across four benchmarks (MT Bench, Vicuna Bench, Feedback Bench, Flask\nEval) shows similar trends, bolstering Prometheus's capability as an evaluator\nLLM. Lastly, Prometheus achieves the highest accuracy on two human preference\nbenchmarks (HHH Alignment & MT Bench Human Judgment) compared to open-sourced\nreward models explicitly trained on human preference datasets, highlighting its\npotential as an universal reward model. We open-source our code, dataset, and\nmodel at https://github.com/kaistAI/Prometheus.\n","authors":["Seungone Kim","Jamin Shin","Yejin Cho","Joel Jang","Shayne Longpre","Hwaran Lee","Sangdoo Yun","Seongjin Shin","Sungdong Kim","James Thorne","Minjoon Seo"],"pdf_url":"https://arxiv.org/pdf/2310.08491v1.pdf","comment":"Work in Progress"},{"id":"http://arxiv.org/abs/2310.08487v1","updated":"2023-10-12T16:46:58Z","published":"2023-10-12T16:46:58Z","title":"GraphextQA: A Benchmark for Evaluating Graph-Enhanced Large Language\n Models","summary":" While multi-modal models have successfully integrated information from image,\nvideo, and audio modalities, integrating graph modality into large language\nmodels (LLMs) remains unexplored. This discrepancy largely stems from the\ninherent divergence between structured graph data and unstructured text data.\nIncorporating graph knowledge provides a reliable source of information,\nenabling potential solutions to address issues in text generation, e.g.,\nhallucination, and lack of domain knowledge. To evaluate the integration of\ngraph knowledge into language models, a dedicated dataset is needed. However,\nthere is currently no benchmark dataset specifically designed for multimodal\ngraph-language models. To address this gap, we propose GraphextQA, a question\nanswering dataset with paired subgraphs, retrieved from Wikidata, to facilitate\nthe evaluation and future development of graph-language models. Additionally,\nwe introduce a baseline model called CrossGNN, which conditions answer\ngeneration on the paired graphs by cross-attending question-aware graph\nfeatures at decoding. The proposed dataset is designed to evaluate\ngraph-language models' ability to understand graphs and make use of it for\nanswer generation. We perform experiments with language-only models and the\nproposed graph-language model to validate the usefulness of the paired graphs\nand to demonstrate the difficulty of the task.\n","authors":["Yuanchun Shen","Ruotong Liao","Zhen Han","Yunpu Ma","Volker Tresp"],"pdf_url":"https://arxiv.org/pdf/2310.08487v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08483v1","updated":"2023-10-12T16:42:53Z","published":"2023-10-12T16:42:53Z","title":"Understanding the Humans Behind Online Misinformation: An Observational\n Study Through the Lens of the COVID-19 Pandemic","summary":" The proliferation of online misinformation has emerged as one of the biggest\nthreats to society. Considerable efforts have focused on building\nmisinformation detection models, still the perils of misinformation remain\nabound. Mitigating online misinformation and its ramifications requires a\nholistic approach that encompasses not only an understanding of its intricate\nlandscape in relation to the complex issue and topic-rich information ecosystem\nonline, but also the psychological drivers of individuals behind it. Adopting a\ntime series analytic technique and robust causal inference-based design, we\nconduct a large-scale observational study analyzing over 32 million COVID-19\ntweets and 16 million historical timeline tweets. We focus on understanding the\nbehavior and psychology of users disseminating misinformation during COVID-19\nand its relationship with the historical inclinations towards sharing\nmisinformation on Non-COVID topics before the pandemic. Our analysis\nunderscores the intricacies inherent to cross-topic misinformation, and\nhighlights that users' historical inclination toward sharing misinformation is\npositively associated with their present behavior pertaining to misinformation\nsharing on emergent topics and beyond. This work may serve as a valuable\nfoundation for designing user-centric inoculation strategies and\necologically-grounded agile interventions for effectively tackling online\nmisinformation.\n","authors":["Mohit Chandra","Anush Mattapalli","Munmun De Choudhury"],"pdf_url":"https://arxiv.org/pdf/2310.08483v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08475v1","updated":"2023-10-12T16:32:44Z","published":"2023-10-12T16:32:44Z","title":"Can We Edit Multimodal Large Language Models?","summary":" In this paper, we focus on editing Multimodal Large Language Models (MLLMs).\nCompared to editing single-modal LLMs, multimodal model editing is more\nchallenging, which demands a higher level of scrutiny and careful consideration\nin the editing process. To facilitate research in this area, we construct a new\nbenchmark, dubbed MMEdit, for editing multimodal LLMs and establishing a suite\nof innovative metrics for evaluation. We conduct comprehensive experiments\ninvolving various model editing baselines and analyze the impact of editing\ndifferent components for multimodal LLMs. Empirically, we notice that previous\nbaselines can implement editing multimodal LLMs to some extent, but the effect\nis still barely satisfactory, indicating the potential difficulty of this task.\nWe hope that our work can provide the NLP community with insights\\footnote{Code\nand dataset are available in https://github.com/zjunlp/EasyEdit.\n","authors":["Siyuan Cheng","Bozhong Tian","Qingbin Liu","Xi Chen","Yongheng Wang","Huajun Chen","Ningyu Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.08475v1.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.08461v1","updated":"2023-10-12T16:21:04Z","published":"2023-10-12T16:21:04Z","title":"DistillSpec: Improving Speculative Decoding via Knowledge Distillation","summary":" Speculative decoding (SD) accelerates large language model inference by\nemploying a faster draft model for generating multiple tokens, which are then\nverified in parallel by the larger target model, resulting in the text\ngenerated according to the target model distribution. However, identifying a\ncompact draft model that is well-aligned with the target model is challenging.\nTo tackle this issue, we propose DistillSpec that uses knowledge distillation\nto better align the draft model with the target model, before applying SD.\nDistillSpec makes two key design choices, which we demonstrate via systematic\nstudy to be crucial to improving the draft and target alignment: utilizing\non-policy data generation from the draft model, and tailoring the divergence\nfunction to the task and decoding strategy. Notably, DistillSpec yields\nimpressive 10 - 45% speedups over standard SD on a range of standard\nbenchmarks, using both greedy and non-greedy sampling. Furthermore, we combine\nDistillSpec with lossy SD to achieve fine-grained control over the latency vs.\ntask performance trade-off. Finally, in practical scenarios with models of\nvarying sizes, first using distillation to boost the performance of the target\nmodel and then applying DistillSpec to train a well-aligned draft model can\nreduce decoding latency by 6-10x with minimal performance drop, compared to\nstandard decoding without distillation.\n","authors":["Yongchao Zhou","Kaifeng Lyu","Ankit Singh Rawat","Aditya Krishna Menon","Afshin Rostamizadeh","Sanjiv Kumar","Jean-François Kagy","Rishabh Agarwal"],"pdf_url":"https://arxiv.org/pdf/2310.08461v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2211.15363v3","updated":"2023-10-12T16:12:57Z","published":"2022-11-28T14:38:45Z","title":"On the Security Vulnerabilities of Text-to-SQL Models","summary":" Although it has been demonstrated that Natural Language Processing (NLP)\nalgorithms are vulnerable to deliberate attacks, the question of whether such\nweaknesses can lead to software security threats is under-explored. To bridge\nthis gap, we conducted vulnerability tests on Text-to-SQL systems that are\ncommonly used to create natural language interfaces to databases. We showed\nthat the Text-to-SQL modules within six commercial applications can be\nmanipulated to produce malicious code, potentially leading to data breaches and\nDenial of Service attacks. This is the first demonstration that NLP models can\nbe exploited as attack vectors in the wild. In addition, experiments using four\nopen-source language models verified that straightforward backdoor attacks on\nText-to-SQL systems achieve a 100% success rate without affecting their\nperformance. The aim of this work is to draw the community's attention to\npotential software security issues associated with NLP algorithms and encourage\nexploration of methods to mitigate against them.\n","authors":["Xutan Peng","Yipeng Zhang","Jingfeng Yang","Mark Stevenson"],"pdf_url":"https://arxiv.org/pdf/2211.15363v3.pdf","comment":"ISSRE 2023: Best Paper Candidate"},{"id":"http://arxiv.org/abs/2305.14259v3","updated":"2023-10-12T16:10:51Z","published":"2023-05-23T17:12:08Z","title":"Learning to Generate Novel Scientific Directions with Contextualized\n Literature-based Discovery","summary":" Literature-Based Discovery (LBD) aims to discover new scientific knowledge by\nmining papers and generating hypotheses. Standard LBD is limited to predicting\npairwise relations between discrete concepts (e.g., drug-disease links), and\nignores critical contexts like experimental settings (e.g., a specific patient\npopulation where a drug is evaluated) and background motivations (e.g., to find\ndrugs without specific side effects). We address these limitations with a novel\nformulation of contextualized-LBD (C-LBD): generating scientific hypotheses in\nnatural language, while grounding them in a context that controls the\nhypothesis search space. We present a modeling framework using retrieval of\n``inspirations'' from past scientific papers. Our evaluations reveal that GPT-4\ntends to generate ideas with overall low technical depth and novelty, while our\ninspiration prompting approaches partially mitigate this issue. Our work\nrepresents a first step toward building language models that generate new ideas\nderived from scientific literature.\n","authors":["Qingyun Wang","Doug Downey","Heng Ji","Tom Hope"],"pdf_url":"https://arxiv.org/pdf/2305.14259v3.pdf","comment":"24 pages. Code and resource is available at\n https://github.com/EagleW/CLBD"},{"id":"http://arxiv.org/abs/2307.07864v2","updated":"2023-10-12T16:06:19Z","published":"2023-07-15T18:25:56Z","title":"CIDER: Context sensitive sentiment analysis for short-form text","summary":" Researchers commonly perform sentiment analysis on large collections of short\ntexts like tweets, Reddit posts or newspaper headlines that are all focused on\na specific topic, theme or event. Usually, general purpose sentiment analysis\nmethods are used which perform well on average but miss the variation in\nmeaning that happens across different contexts, for example, the word \"active\"\nhas a very different intention and valence in the phrase \"active lifestyle\"\nversus \"active volcano\". This work presents a new approach, CIDER (Context\nInformed Dictionary and sEntiment Reasoner), which performs context sensitive\nsentiment analysis, where the valence of sentiment laden terms is inferred from\nthe whole corpus before being used to score the individual texts. In this paper\nwe detail the CIDER algorithm and demonstrate that it outperforms\nstate-of-the-art generalist sentiment analysis on a large collection of tweets\nabout the weather. We have made our implementation of CIDER available as a\npython package: https://pypi.org/project/ciderpolarity/.\n","authors":["James C. Young","Rudy Arthur","Hywel T. P. Williams"],"pdf_url":"https://arxiv.org/pdf/2307.07864v2.pdf","comment":"12 pages, 2 figures, 5 tables"},{"id":"http://arxiv.org/abs/2310.04472v2","updated":"2023-10-12T15:57:04Z","published":"2023-10-06T04:48:48Z","title":"Effective Slogan Generation with Noise Perturbation","summary":" Slogans play a crucial role in building the brand's identity of the firm. A\nslogan is expected to reflect firm's vision and brand's value propositions in\nmemorable and likeable ways. Automating the generation of slogans with such\ncharacteristics is challenging. Previous studies developted and tested slogan\ngeneration with syntactic control and summarization models which are not\ncapable of generating distinctive slogans. We introduce a a novel apporach that\nleverages pre-trained transformer T5 model with noise perturbation on newly\nproposed 1:N matching pair dataset. This approach serves as a contributing\nfator in generting distinctive and coherent slogans. Turthermore, the proposed\napproach incorporates descriptions about the firm and brand into the generation\nof slogans. We evaluate generated slogans based on ROUGE1, ROUGEL and Cosine\nSimilarity metrics and also assess them with human subjects in terms of\nslogan's distinctiveness, coherence, and fluency. The results demonstrate that\nour approach yields better performance than baseline models and other\ntransformer-based models.\n","authors":["Jongeun Kim","MinChung Kim","Taehwan Kim"],"pdf_url":"https://arxiv.org/pdf/2310.04472v2.pdf","comment":"Accepted in CIKM 2023 short paper\n https://github.com/joannekim0420/SloganGeneration"},{"id":"http://arxiv.org/abs/2310.08433v1","updated":"2023-10-12T15:56:24Z","published":"2023-10-12T15:56:24Z","title":"A Confederacy of Models: a Comprehensive Evaluation of LLMs on Creative\n Writing","summary":" We evaluate a range of recent LLMs on English creative writing, a challenging\nand complex task that requires imagination, coherence, and style. We use a\ndifficult, open-ended scenario chosen to avoid training data reuse: an epic\nnarration of a single combat between Ignatius J. Reilly, the protagonist of the\nPulitzer Prize-winning novel A Confederacy of Dunces (1980), and a pterodactyl,\na prehistoric flying reptile. We ask several LLMs and humans to write such a\nstory and conduct a human evalution involving various criteria such as fluency,\ncoherence, originality, humor, and style. Our results show that some\nstate-of-the-art commercial LLMs match or slightly outperform our writers in\nmost dimensions; whereas open-source LLMs lag behind. Humans retain an edge in\ncreativity, while humor shows a binary divide between LLMs that can handle it\ncomparably to humans and those that fail at it. We discuss the implications and\nlimitations of our study and suggest directions for future research.\n","authors":["Carlos Gómez-Rodríguez","Paul Williams"],"pdf_url":"https://arxiv.org/pdf/2310.08433v1.pdf","comment":"Accepted for publication in Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2302.11713v4","updated":"2023-10-12T15:30:41Z","published":"2023-02-23T00:33:54Z","title":"Can Pre-trained Vision and Language Models Answer Visual\n Information-Seeking Questions?","summary":" Pre-trained vision and language models have demonstrated state-of-the-art\ncapabilities over existing tasks involving images and texts, including visual\nquestion answering. However, it remains unclear whether these models possess\nthe capability to answer questions that are not only querying visual content\nbut knowledge-intensive and information-seeking. In this study, we introduce\nInfoSeek, a visual question answering dataset tailored for information-seeking\nquestions that cannot be answered with only common sense knowledge. Using\nInfoSeek, we analyze various pre-trained visual question answering models and\ngain insights into their characteristics. Our findings reveal that\nstate-of-the-art pre-trained multi-modal models (e.g., PaLI-X, BLIP2, etc.)\nface challenges in answering visual information-seeking questions, but\nfine-tuning on the InfoSeek dataset elicits models to use fine-grained\nknowledge that was learned during their pre-training. Furthermore, we show that\naccurate visual entity recognition can be used to improve performance on\nInfoSeek by retrieving relevant documents, showing a significant space for\nimprovement.\n","authors":["Yang Chen","Hexiang Hu","Yi Luan","Haitian Sun","Soravit Changpinyo","Alan Ritter","Ming-Wei Chang"],"pdf_url":"https://arxiv.org/pdf/2302.11713v4.pdf","comment":"EMNLP 2023 (main conference); Our dataset and evaluation is available\n at https://open-vision-language.github.io/infoseek/"},{"id":"http://arxiv.org/abs/2310.08395v1","updated":"2023-10-12T15:08:14Z","published":"2023-10-12T15:08:14Z","title":"Prompting Large Language Models with Chain-of-Thought for Few-Shot\n Knowledge Base Question Generation","summary":" The task of Question Generation over Knowledge Bases (KBQG) aims to convert a\nlogical form into a natural language question. For the sake of expensive cost\nof large-scale question annotation, the methods of KBQG under low-resource\nscenarios urgently need to be developed. However, current methods heavily rely\non annotated data for fine-tuning, which is not well-suited for few-shot\nquestion generation. The emergence of Large Language Models (LLMs) has shown\ntheir impressive generalization ability in few-shot tasks. Inspired by\nChain-of-Thought (CoT) prompting, which is an in-context learning strategy for\nreasoning, we formulate KBQG task as a reasoning problem, where the generation\nof a complete question is splitted into a series of sub-question generation.\nOur proposed prompting method KQG-CoT first retrieves supportive logical forms\nfrom the unlabeled data pool taking account of the characteristics of the\nlogical form. Then, we write a prompt to explicit the reasoning chain of\ngenerating complicated questions based on the selected demonstrations. To\nfurther ensure prompt quality, we extend KQG-CoT into KQG-CoT+ via sorting the\nlogical forms by their complexity. We conduct extensive experiments over three\npublic KBQG datasets. The results demonstrate that our prompting method\nconsistently outperforms other prompting baselines on the evaluated datasets.\nRemarkably, our KQG-CoT+ method could surpass existing few-shot SoTA results of\nthe PathQuestions dataset by 18.25, 10.72, and 10.18 absolute points on BLEU-4,\nMETEOR, and ROUGE-L, respectively.\n","authors":["Yuanyuan Liang","Jianing Wang","Hanlun Zhu","Lei Wang","Weining Qian","Yunshi Lan"],"pdf_url":"https://arxiv.org/pdf/2310.08395v1.pdf","comment":"Accepted by EMNLP 2023 main conference"},{"id":"http://arxiv.org/abs/2310.08394v1","updated":"2023-10-12T15:07:11Z","published":"2023-10-12T15:07:11Z","title":"Towards Better Evaluation of Instruction-Following: A Case-Study in\n Summarization","summary":" Despite recent advances, evaluating how well large language models (LLMs)\nfollow user instructions remains an open problem. While evaluation methods of\nlanguage models have seen a rise in prompt-based approaches, limited work on\nthe correctness of these methods has been conducted. In this work, we perform a\nmeta-evaluation of a variety of metrics to quantify how accurately they measure\nthe instruction-following abilities of LLMs. Our investigation is performed on\ngrounded query-based summarization by collecting a new short-form, real-world\ndataset riSum, containing $300$ document-instruction pairs with $3$ answers\neach. All $900$ answers are rated by $3$ human annotators. Using riSum, we\nanalyze agreement between evaluation methods and human judgment. Finally, we\npropose new LLM-based reference-free evaluation methods that improve upon\nestablished baselines and perform on-par with costly reference-based metrics\nwhich require high-quality summaries.\n","authors":["Ondrej Skopek","Rahul Aralikatte","Sian Gooding","Victor Carbune"],"pdf_url":"https://arxiv.org/pdf/2310.08394v1.pdf","comment":"Accepted to CoNLL 2023"},{"id":"http://arxiv.org/abs/2310.08383v1","updated":"2023-10-12T14:57:24Z","published":"2023-10-12T14:57:24Z","title":"Reconstructing Materials Tetrahedron: Challenges in Materials\n Information Extraction","summary":" Discovery of new materials has a documented history of propelling human\nprogress for centuries and more. The behaviour of a material is a function of\nits composition, structure, and properties, which further depend on its\nprocessing and testing conditions. Recent developments in deep learning and\nnatural language processing have enabled information extraction at scale from\npublished literature such as peer-reviewed publications, books, and patents.\nHowever, this information is spread in multiple formats, such as tables, text,\nand images, and with little or no uniformity in reporting style giving rise to\nseveral machine learning challenges. Here, we discuss, quantify, and document\nthese outstanding challenges in automated information extraction (IE) from\nmaterials science literature towards the creation of a large materials science\nknowledge base. Specifically, we focus on IE from text and tables and outline\nseveral challenges with examples. We hope the present work inspires researchers\nto address the challenges in a coherent fashion, providing to fillip to IE for\nthe materials knowledge base.\n","authors":["Kausik Hira","Mohd Zaki","Dhruvil Sheth"," Mausam","N M Anoop Krishnan"],"pdf_url":"https://arxiv.org/pdf/2310.08383v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08372v1","updated":"2023-10-12T14:44:05Z","published":"2023-10-12T14:44:05Z","title":"Improving Factual Consistency for Knowledge-Grounded Dialogue Systems\n via Knowledge Enhancement and Alignment","summary":" Pretrained language models (PLMs) based knowledge-grounded dialogue systems\nare prone to generate responses that are factually inconsistent with the\nprovided knowledge source. In such inconsistent responses, the dialogue models\nfail to accurately express the external knowledge they rely upon. Inspired by\nprevious work which identified that feed-forward networks (FFNs) within\nTransformers are responsible for factual knowledge expressions, we investigate\ntwo methods to efficiently improve the factual expression capability {of FFNs}\nby knowledge enhancement and alignment respectively. We first propose\n\\textsc{K-Dial}, which {explicitly} introduces {extended FFNs in Transformers\nto enhance factual knowledge expressions} given the specific patterns of\nknowledge-grounded dialogue inputs. Additionally, we apply the reinforcement\nlearning for factual consistency (RLFC) method to implicitly adjust FFNs'\nexpressions in responses by aligning with gold knowledge for the factual\nconsistency preference. To comprehensively assess the factual consistency and\ndialogue quality of responses, we employ extensive automatic measures and human\nevaluations including sophisticated fine-grained NLI-based metrics.\nExperimental results on WoW and CMU\\_DoG datasets demonstrate that our methods\nefficiently enhance the ability of the FFN module to convey factual knowledge,\nvalidating the efficacy of improving factual consistency for knowledge-grounded\ndialogue systems.\n","authors":["Boyang Xue","Weichao Wang","Hongru Wang","Fei Mi","Rui Wang","Yasheng Wang","Lifeng Shang","Xin Jiang","Qun Liu","Kam-Fai Wong"],"pdf_url":"https://arxiv.org/pdf/2310.08372v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08367v1","updated":"2023-10-12T14:38:25Z","published":"2023-10-12T14:38:25Z","title":"MCU: A Task-centric Framework for Open-ended Agent Evaluation in\n Minecraft","summary":" To pursue the goal of creating an open-ended agent in Minecraft, an\nopen-ended game environment with unlimited possibilities, this paper introduces\na task-centric framework named MCU for Minecraft agent evaluation. The MCU\nframework leverages the concept of atom tasks as fundamental building blocks,\nenabling the generation of diverse or even arbitrary tasks. Within the MCU\nframework, each task is measured with six distinct difficulty scores (time\nconsumption, operational effort, planning complexity, intricacy, creativity,\nnovelty). These scores offer a multi-dimensional assessment of a task from\ndifferent angles, and thus can reveal an agent's capability on specific facets.\nThe difficulty scores also serve as the feature of each task, which creates a\nmeaningful task space and unveils the relationship between tasks. For efficient\nevaluation of Minecraft agents employing the MCU framework, we maintain a\nunified benchmark, namely SkillForge, which comprises representative tasks with\ndiverse categories and difficulty distribution. We also provide convenient\nfilters for users to select tasks to assess specific capabilities of agents. We\nshow that MCU has the high expressivity to cover all tasks used in recent\nliterature on Minecraft agent, and underscores the need for advancements in\nareas such as creativity, precise control, and out-of-distribution\ngeneralization under the goal of open-ended Minecraft agent development.\n","authors":["Haowei Lin","Zihao Wang","Jianzhu Ma","Yitao Liang"],"pdf_url":"https://arxiv.org/pdf/2310.08367v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.03128v2","updated":"2023-10-12T14:37:55Z","published":"2023-10-04T19:39:26Z","title":"MetaTool Benchmark for Large Language Models: Deciding Whether to Use\n Tools and Which to Use","summary":" Large language models (LLMs) have garnered significant attention due to their\nimpressive natural language processing (NLP) capabilities. Recently, many\nstudies have focused on the tool utilization ability of LLMs. They primarily\ninvestigated how LLMs effectively collaborate with given specific tools.\nHowever, in scenarios where LLMs serve as intelligent agents, as seen in\napplications like AutoGPT and MetaGPT, LLMs are expected to engage in intricate\ndecision-making processes that involve deciding whether to employ a tool and\nselecting the most suitable tool(s) from a collection of available tools to\nfulfill user requests. Therefore, in this paper, we introduce MetaTool, a\nbenchmark designed to evaluate whether LLMs have tool usage awareness and can\ncorrectly choose tools. Specifically, we create a dataset called ToolE within\nthe benchmark. This dataset contains various types of user queries in the form\nof prompts that trigger LLMs to use tools, including both single-tool and\nmulti-tool scenarios. Subsequently, we set the tasks for both tool usage\nawareness and tool selection. We define four subtasks from different\nperspectives in tool selection, including tool selection with similar choices,\ntool selection in specific scenarios, tool selection with possible reliability\nissues, and multi-tool selection. We conduct experiments involving nine popular\nLLMs and find that the majority of them still struggle to effectively select\ntools, highlighting the existing gaps between LLMs and genuine intelligent\nagents. However, through the error analysis, we found there is still\nsignificant room for improvement. Finally, we conclude with insights for tool\ndevelopers that follow ChatGPT to provide detailed descriptions that can\nenhance the tool selection performance of LLMs.\n","authors":["Yue Huang","Jiawen Shi","Yuan Li","Chenrui Fan","Siyuan Wu","Qihui Zhang","Yixin Liu","Pan Zhou","Yao Wan","Neil Zhenqiang Gong","Lichao Sun"],"pdf_url":"https://arxiv.org/pdf/2310.03128v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08365v1","updated":"2023-10-12T14:36:13Z","published":"2023-10-12T14:36:13Z","title":"From Large Language Models to Knowledge Graphs for Biomarker Discovery\n in Cancer","summary":" Domain experts often rely on up-to-date knowledge for apprehending and\ndisseminating specific biological processes that help them design strategies to\ndevelop prevention and therapeutic decision-making. A challenging scenario for\nartificial intelligence (AI) is using biomedical data (e.g., texts, imaging,\nomics, and clinical) to provide diagnosis and treatment recommendations for\ncancerous conditions. Data and knowledge about cancer, drugs, genes, proteins,\nand their mechanism is spread across structured (knowledge bases (KBs)) and\nunstructured (e.g., scientific articles) sources. A large-scale knowledge graph\n(KG) can be constructed by integrating these data, followed by extracting facts\nabout semantically interrelated entities and relations. Such KGs not only allow\nexploration and question answering (QA) but also allow domain experts to deduce\nnew knowledge. However, exploring and querying large-scale KGs is tedious for\nnon-domain users due to a lack of understanding of the underlying data assets\nand semantic technologies. In this paper, we develop a domain KG to leverage\ncancer-specific biomarker discovery and interactive QA. For this, a domain\nontology called OncoNet Ontology (ONO) is developed to enable semantic\nreasoning for validating gene-disease relations. The KG is then enriched by\nharmonizing the ONO, controlled vocabularies, and additional biomedical\nconcepts from scientific articles by employing BioBERT- and SciBERT-based\ninformation extraction (IE) methods. Further, since the biomedical domain is\nevolving, where new findings often replace old ones, without employing\nup-to-date findings, there is a high chance an AI system exhibits concept drift\nwhile providing diagnosis and treatment. Therefore, we finetuned the KG using\nlarge language models (LLMs) based on more recent articles and KBs that might\nnot have been seen by the named entity recognition models.\n","authors":["Md. Rezaul Karim","Lina Molinas Comet","Md Shajalal","Oya Beyan","Dietrich Rebholz-Schuhmann","Stefan Decker"],"pdf_url":"https://arxiv.org/pdf/2310.08365v1.pdf","comment":"arXiv admin note: substantial text overlap with arXiv:2302.04737"},{"id":"http://arxiv.org/abs/2304.11657v2","updated":"2023-10-12T13:57:58Z","published":"2023-04-23T13:54:39Z","title":"Enhancing Chain-of-Thoughts Prompting with Iterative Bootstrapping in\n Large Language Models","summary":" Large language models (LLMs) can achieve highly effective performance on\nvarious reasoning tasks by incorporating step-by-step chain-of-thought (CoT)\nprompting as demonstrations. However, the reasoning chains of demonstrations\ngenerated by LLMs are prone to errors, which can subsequently lead to incorrect\nreasoning during inference. Furthermore, inappropriate exemplars (overly\nsimplistic or complex), can affect overall performance among varying levels of\ndifficulty. We introduce Iter-CoT (Iterative bootstrapping in Chain-of-Thoughts\nPrompting), an iterative bootstrapping approach for selecting exemplars and\ngenerating reasoning chains. By utilizing iterative bootstrapping, our approach\nenables LLMs to autonomously rectify errors, resulting in more precise and\ncomprehensive reasoning chains. Simultaneously, our approach selects\nchallenging yet answerable questions accompanied by reasoning chains as\nexemplars with a moderate level of difficulty, which enhances the LLMs'\ngeneralizability across varying levels of difficulty. Experimental results\nindicate that Iter-CoT exhibits superiority, achieving competitive performance\nacross three distinct reasoning tasks on ten datasets.\n","authors":["Jiashuo Sun","Yi Luo","Yeyun Gong","Chen Lin","Yelong Shen","Jian Guo","Nan Duan"],"pdf_url":"https://arxiv.org/pdf/2304.11657v2.pdf","comment":"28 pages, 10 figures, 21 tables"},{"id":"http://arxiv.org/abs/2309.16396v3","updated":"2023-10-12T13:39:39Z","published":"2023-09-28T12:43:32Z","title":"A Comprehensive Survey of Document-level Relation Extraction (2016-2023)","summary":" Document-level relation extraction (DocRE) is an active area of research in\nnatural language processing (NLP) concerned with identifying and extracting\nrelationships between entities beyond sentence boundaries. Compared to the more\ntraditional sentence-level relation extraction, DocRE provides a broader\ncontext for analysis and is more challenging because it involves identifying\nrelationships that may span multiple sentences or paragraphs. This task has\ngained increased interest as a viable solution to build and populate knowledge\nbases automatically from unstructured large-scale documents (e.g., scientific\npapers, legal contracts, or news articles), in order to have a better\nunderstanding of relationships between entities. This paper aims to provide a\ncomprehensive overview of recent advances in this field, highlighting its\ndifferent applications in comparison to sentence-level relation extraction.\n","authors":["Julien Delaunay","Hanh Thi Hong Tran","Carlos-Emiliano González-Gallardo","Georgeta Bordea","Nicolas Sidere","Antoine Doucet"],"pdf_url":"https://arxiv.org/pdf/2309.16396v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08320v1","updated":"2023-10-12T13:33:04Z","published":"2023-10-12T13:33:04Z","title":"Defending Our Privacy With Backdoors","summary":" The proliferation of large AI models trained on uncurated, often sensitive\nweb-scraped data has raised significant privacy concerns. One of the concerns\nis that adversaries can extract information about the training data using\nprivacy attacks. Unfortunately, the task of removing specific information from\nthe models without sacrificing performance is not straightforward and has\nproven to be challenging. We propose a rather easy yet effective defense based\non backdoor attacks to remove private information such as names of individuals\nfrom models, and focus in this work on text encoders. Specifically, through\nstrategic insertion of backdoors, we align the embeddings of sensitive phrases\nwith those of neutral terms-\"a person\" instead of the person's name. Our\nempirical results demonstrate the effectiveness of our backdoor-based defense\non CLIP by assessing its performance using a specialized privacy attack for\nzero-shot classifiers. Our approach provides not only a new \"dual-use\"\nperspective on backdoor attacks, but also presents a promising avenue to\nenhance the privacy of individuals within models trained on uncurated\nweb-scraped data.\n","authors":["Dominik Hintersdorf","Lukas Struppek","Daniel Neider","Kristian Kersting"],"pdf_url":"https://arxiv.org/pdf/2310.08320v1.pdf","comment":"14 pages, 4 figures"},{"id":"http://arxiv.org/abs/2310.08309v1","updated":"2023-10-12T13:15:11Z","published":"2023-10-12T13:15:11Z","title":"Not All Demonstration Examples are Equally Beneficial: Reweighting\n Demonstration Examples for In-Context Learning","summary":" Large Language Models (LLMs) have recently gained the In-Context Learning\n(ICL) ability with the models scaling up, allowing them to quickly adapt to\ndownstream tasks with only a few demonstration examples prepended in the input\nsequence. Nonetheless, the current practice of ICL treats all demonstration\nexamples equally, which still warrants improvement, as the quality of examples\nis usually uneven. In this paper, we investigate how to determine approximately\noptimal weights for demonstration examples and how to apply them during ICL. To\nassess the quality of weights in the absence of additional validation data, we\ndesign a masked self-prediction (MSP) score that exhibits a strong correlation\nwith the final ICL performance. To expedite the weight-searching process, we\ndiscretize the continuous weight space and adopt beam search. With\napproximately optimal weights obtained, we further propose two strategies to\napply them to demonstrations at different model positions. Experimental results\non 8 text classification tasks show that our approach outperforms conventional\nICL by a large margin. Our code are publicly available at\nhttps:github.com/Zhe-Young/WICL.\n","authors":["Zhe Yang","Damai Dai","Peiyi Wang","Zhifang Sui"],"pdf_url":"https://arxiv.org/pdf/2310.08309v1.pdf","comment":"Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.08298v1","updated":"2023-10-12T13:02:34Z","published":"2023-10-12T13:02:34Z","title":"MProto: Multi-Prototype Network with Denoised Optimal Transport for\n Distantly Supervised Named Entity Recognition","summary":" Distantly supervised named entity recognition (DS-NER) aims to locate entity\nmentions and classify their types with only knowledge bases or gazetteers and\nunlabeled corpus. However, distant annotations are noisy and degrade the\nperformance of NER models. In this paper, we propose a noise-robust prototype\nnetwork named MProto for the DS-NER task. Different from previous\nprototype-based NER methods, MProto represents each entity type with multiple\nprototypes to characterize the intra-class variance among entity\nrepresentations. To optimize the classifier, each token should be assigned an\nappropriate ground-truth prototype and we consider such token-prototype\nassignment as an optimal transport (OT) problem. Furthermore, to mitigate the\nnoise from incomplete labeling, we propose a novel denoised optimal transport\n(DOT) algorithm. Specifically, we utilize the assignment result between Other\nclass tokens and all prototypes to distinguish unlabeled entity tokens from\ntrue negatives. Experiments on several DS-NER benchmarks demonstrate that our\nMProto achieves state-of-the-art performance. The source code is now available\non Github.\n","authors":["Shuhui Wu","Yongliang Shen","Zeqi Tan","Wenqi Ren","Jietian Guo","Shiliang Pu","Weiming Lu"],"pdf_url":"https://arxiv.org/pdf/2310.08298v1.pdf","comment":"Accepted to EMNLP-2023, camera ready version"},{"id":"http://arxiv.org/abs/2310.08291v1","updated":"2023-10-12T12:52:46Z","published":"2023-10-12T12:52:46Z","title":"Expanding the Vocabulary of BERT for Knowledge Base Construction","summary":" Knowledge base construction entails acquiring structured information to\ncreate a knowledge base of factual and relational data, facilitating question\nanswering, information retrieval, and semantic understanding. The challenge\ncalled \"Knowledge Base Construction from Pretrained Language Models\" at\nInternational Semantic Web Conference 2023 defines tasks focused on\nconstructing knowledge base using language model. Our focus was on Track 1 of\nthe challenge, where the parameters are constrained to a maximum of 1 billion,\nand the inclusion of entity descriptions within the prompt is prohibited.\n Although the masked language model offers sufficient flexibility to extend\nits vocabulary, it is not inherently designed for multi-token prediction. To\naddress this, we present Vocabulary Expandable BERT for knowledge base\nconstruction, which expand the language model's vocabulary while preserving\nsemantic embeddings for newly added words. We adopt task-specific\nre-pre-training on masked language model to further enhance the language model.\n Through experimentation, the results show the effectiveness of our\napproaches. Our framework achieves F1 score of 0.323 on the hidden test set and\n0.362 on the validation set, both data set is provided by the challenge.\nNotably, our framework adopts a lightweight language model (BERT-base, 0.13\nbillion parameters) and surpasses the model using prompts directly on large\nlanguage model (Chatgpt-3, 175 billion parameters). Besides, Token-Recode\nachieves comparable performances as Re-pretrain. This research advances\nlanguage understanding models by enabling the direct embedding of multi-token\nentities, signifying a substantial step forward in link prediction task in\nknowledge graph and metadata completion in data management.\n","authors":["Dong Yang","Xu Wang","Remzi Celebi"],"pdf_url":"https://arxiv.org/pdf/2310.08291v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08280v1","updated":"2023-10-12T12:36:46Z","published":"2023-10-12T12:36:46Z","title":"Optimizing Odia Braille Literacy: The Influence of Speed on Error\n Reduction and Enhanced Comprehension","summary":" This study aims to conduct an extensive detailed analysis of the Odia Braille\nreading comprehension among students with visual disability. Specifically, the\nstudy explores their reading speed and hand or finger movements. The study also\naims to investigate any comprehension difficulties and reading errors they may\nencounter. Six students from the 9th and 10th grades, aged between 14 and 16,\nparticipated in the study. We observed participants hand movements to\nunderstand how reading errors were connected to hand movement and identify the\nstudents reading difficulties. We also evaluated the participants Odia Braille\nreading skills, including their reading speed (in words per minute), errors,\nand comprehension. The average speed of Odia Braille reader is 17.64wpm.\nAccording to the study, there was a noticeable correlation between reading\nspeed and reading errors. As reading speed decreased, the number of reading\nerrors tended to increase. Moreover, the study established a link between\nreduced Braille reading errors and improved reading comprehension. In contrast,\nthe study found that better comprehension was associated with increased reading\nspeed. The researchers concluded with some interesting findings about preferred\nBraille reading patterns. These findings have important theoretical,\ndevelopmental, and methodological implications for instruction.\n","authors":["Monnie Parida","Manjira Sinha","Anupam Basu","Pabitra Mitra"],"pdf_url":"https://arxiv.org/pdf/2310.08280v1.pdf","comment":"4 Pages, Paper accepted in Diversity and Inclusion track at\n CODS-COMAD 2024"},{"id":"http://arxiv.org/abs/2310.08279v1","updated":"2023-10-12T12:31:23Z","published":"2023-10-12T12:31:23Z","title":"CP-KGC: Constrained-Prompt Knowledge Graph Completion with Large\n Language Models","summary":" Knowledge graph completion (KGC) aims to utilize existing knowledge to deduce\nand infer missing connections within knowledge graphs. Text-based approaches,\nlike SimKGC, have outperformed graph embedding methods, showcasing the promise\nof inductive KGC. However, the efficacy of text-based methods hinges on the\nquality of entity textual descriptions. In this paper, we identify the key\nissue of whether large language models (LLMs) can generate effective text. To\nmitigate hallucination in LLM-generated text in this paper, we introduce a\nconstraint-based prompt that utilizes the entity and its textual description as\ncontextual constraints to enhance data quality. Our Constrained-Prompt\nKnowledge Graph Completion (CP-KGC) method demonstrates effective inference\nunder low resource computing conditions and surpasses prior results on the\nWN18RR and FB15K237 datasets. This showcases the integration of LLMs in KGC\ntasks and provides new directions for future research.\n","authors":["Rui Yang","Li Fang","Yi Zhou"],"pdf_url":"https://arxiv.org/pdf/2310.08279v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.12053v3","updated":"2023-10-12T12:06:37Z","published":"2023-09-21T13:20:13Z","title":"AceGPT, Localizing Large Language Models in Arabic","summary":" This paper is devoted to the development of a localized Large Language Model\n(LLM) specifically for Arabic, a language imbued with unique cultural\ncharacteristics inadequately addressed by current mainstream models.\nSignificant concerns emerge when addressing cultural sensitivity and local\nvalues. To address this, the paper proposes a comprehensive solution that\nincludes further pre-training with Arabic texts, Supervised Fine-Tuning (SFT)\nutilizing native Arabic instructions, and GPT-4 responses in Arabic, alongside\nReinforcement Learning with AI Feedback (RLAIF) employing a reward model\nattuned to local culture and values. The goal is to cultivate culturally\ncognizant and value-aligned Arabic LLMs capable of accommodating the diverse,\napplication-specific needs of Arabic-speaking communities. Comprehensive\nevaluations reveal that the resulting model, dubbed 'AceGPT', sets the\nstate-of-the-art standard for open Arabic LLMs across various benchmarks,\nincluding the instruction-following benchmark (i.e., Arabic Vicuna-80 and\nArabic AlpacaEval), knowledge benchmark (i.e., Arabic MMLU and EXAMs), and the\nnewly introduced Arabic Cultural and Value Alignment benchmark. Notably, AceGPT\noutperforms Turbo in the popular Vicuna-80 benchmark when evaluated with GPT-4,\ndespite the benchmark's limited scale. Codes, data, and models are in\nhttps://github.com/FreedomIntelligence/AceGPT.\n","authors":["Huang Huang","Fei Yu","Jianqing Zhu","Xuening Sun","Hao Cheng","Dingjie Song","Zhihong Chen","Abdulmohsen Alharthi","Bang An","Ziche Liu","Zhiyi Zhang","Junying Chen","Jianquan Li","Benyou Wang","Lian Zhang","Ruoyu Sun","Xiang Wan","Haizhou Li","Jinchao Xu"],"pdf_url":"https://arxiv.org/pdf/2309.12053v3.pdf","comment":"https://github.com/FreedomIntelligence/AceGPT"},{"id":"http://arxiv.org/abs/2310.08256v1","updated":"2023-10-12T12:01:32Z","published":"2023-10-12T12:01:32Z","title":"Impact of Co-occurrence on Factual Knowledge of Large Language Models","summary":" Large language models (LLMs) often make factually incorrect responses despite\ntheir success in various applications. In this paper, we hypothesize that\nrelying heavily on simple co-occurrence statistics of the pre-training corpora\nis one of the main factors that cause factual errors. Our results reveal that\nLLMs are vulnerable to the co-occurrence bias, defined as preferring frequently\nco-occurred words over the correct answer. Consequently, LLMs struggle to\nrecall facts whose subject and object rarely co-occur in the pre-training\ndataset although they are seen during finetuning. We show that co-occurrence\nbias remains despite scaling up model sizes or finetuning. Therefore, we\nsuggest finetuning on a debiased dataset to mitigate the bias by filtering out\nbiased samples whose subject-object co-occurrence count is high. Although\ndebiased finetuning allows LLMs to memorize rare facts in the training set, it\nis not effective in recalling rare facts unseen during finetuning. Further\nresearch in mitigation will help build reliable language models by preventing\npotential errors. The code is available at\n\\url{https://github.com/CheongWoong/impact_of_cooccurrence}.\n","authors":["Cheongwoong Kang","Jaesik Choi"],"pdf_url":"https://arxiv.org/pdf/2310.08256v1.pdf","comment":"EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2309.02092v3","updated":"2023-10-12T11:46:23Z","published":"2023-09-05T09:56:29Z","title":"Where are We in Event-centric Emotion Analysis? Bridging Emotion Role\n Labeling and Appraisal-based Approaches","summary":" The term emotion analysis in text subsumes various natural language\nprocessing tasks which have in common the goal to enable computers to\nunderstand emotions. Most popular is emotion classification in which one or\nmultiple emotions are assigned to a predefined textual unit. While such setting\nis appropriate for identifying the reader's or author's emotion, emotion role\nlabeling adds the perspective of mentioned entities and extracts text spans\nthat correspond to the emotion cause. The underlying emotion theories agree on\none important point; that an emotion is caused by some internal or external\nevent and comprises several subcomponents, including the subjective feeling and\na cognitive evaluation. We therefore argue that emotions and events are related\nin two ways. (1) Emotions are events; and this perspective is the fundament in\nnatural language processing for emotion role labeling. (2) Emotions are caused\nby events; a perspective that is made explicit with research how to incorporate\npsychological appraisal theories in NLP models to interpret events. These two\nresearch directions, role labeling and (event-focused) emotion classification,\nhave by and large been tackled separately. In this paper, we contextualize both\nperspectives and discuss open research questions.\n","authors":["Roman Klinger"],"pdf_url":"https://arxiv.org/pdf/2309.02092v3.pdf","comment":"accepted to the Big Picture Workshop\n (https://bigpictureworkshop.com/)"},{"id":"http://arxiv.org/abs/2305.15020v2","updated":"2023-10-12T11:45:56Z","published":"2023-05-24T11:00:33Z","title":"An Efficient Multilingual Language Model Compression through Vocabulary\n Trimming","summary":" Multilingual language model (LM) have become a powerful tool in NLP\nespecially for non-English languages. Nevertheless, model parameters of\nmultilingual LMs remain large due to the larger embedding matrix of the\nvocabulary covering tokens in different languages. On the contrary, monolingual\nLMs can be trained in a target language with the language-specific vocabulary\nonly, but this requires a large budget and availability of reliable corpora to\nachieve a high-quality LM from scratch. In this paper, we propose\nvocabulary-trimming (VT), a method to reduce a multilingual LM vocabulary to a\ntarget language by deleting irrelevant tokens from its vocabulary. In theory,\nVT can compress any existing multilingual LM to build monolingual LMs in any\nlanguage covered by the multilingual LM. In our experiments, we show that VT\ncan retain the original performance of the multilingual LM, while being smaller\nin size (in general around 50% of the original vocabulary size is enough) than\nthe original multilingual LM. The evaluation is performed over four NLP tasks\n(two generative and two classification tasks) among four widely used\nmultilingual LMs in seven languages. Finally, we show that this methodology can\nkeep the best of both monolingual and multilingual worlds by keeping a small\nsize as monolingual models without the need for specifically retraining them,\nand even limiting potentially harmful social biases.\n","authors":["Asahi Ushio","Yi Zhou","Jose Camacho-Collados"],"pdf_url":"https://arxiv.org/pdf/2305.15020v2.pdf","comment":"EMNLP 2023 findings"},{"id":"http://arxiv.org/abs/2304.00457v3","updated":"2023-10-12T11:35:35Z","published":"2023-04-02T05:47:09Z","title":"LLMMaps -- A Visual Metaphor for Stratified Evaluation of Large Language\n Models","summary":" Large Language Models (LLMs) have revolutionized natural language processing\nand demonstrated impressive capabilities in various tasks. Unfortunately, they\nare prone to hallucinations, where the model exposes incorrect or false\ninformation in its responses, which renders diligent evaluation approaches\nmandatory. While LLM performance in specific knowledge fields is often\nevaluated based on question and answer (Q&A) datasets, such evaluations usually\nreport only a single accuracy number for the dataset, which often covers an\nentire field. This field-based evaluation, is problematic with respect to\ntransparency and model improvement. A stratified evaluation could instead\nreveal subfields, where hallucinations are more likely to occur and thus help\nto better assess LLMs' risks and guide their further development. To support\nsuch stratified evaluations, we propose LLMMaps as a novel visualization\ntechnique that enables users to evaluate LLMs' performance with respect to Q&A\ndatasets. LLMMaps provide detailed insights into LLMs' knowledge capabilities\nin different subfields, by transforming Q&A datasets as well as LLM responses\ninto an internal knowledge structure. An extension for comparative\nvisualization furthermore, allows for the detailed comparison of multiple LLMs.\nTo assess LLMMaps we use them to conduct a comparative analysis of several\nstate-of-the-art LLMs, such as BLOOM, GPT-2, GPT-3, ChatGPT and LLaMa-13B, as\nwell as two qualitative user evaluations. All necessary source code and data\nfor generating LLMMaps to be used in scientific publications and elsewhere is\navailable on GitHub: https://github.com/viscom-ulm/LLMMaps\n","authors":["Patrik Puchert","Poonam Poonam","Christian van Onzenoodt","Timo Ropinski"],"pdf_url":"https://arxiv.org/pdf/2304.00457v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08240v1","updated":"2023-10-12T11:35:24Z","published":"2023-10-12T11:35:24Z","title":"Who Said That? Benchmarking Social Media AI Detection","summary":" AI-generated text has proliferated across various online platforms, offering\nboth transformative prospects and posing significant risks related to\nmisinformation and manipulation. Addressing these challenges, this paper\nintroduces SAID (Social media AI Detection), a novel benchmark developed to\nassess AI-text detection models' capabilities in real social media platforms.\nIt incorporates real AI-generate text from popular social media platforms like\nZhihu and Quora. Unlike existing benchmarks, SAID deals with content that\nreflects the sophisticated strategies employed by real AI users on the Internet\nwhich may evade detection or gain visibility, providing a more realistic and\nchallenging evaluation landscape. A notable finding of our study, based on the\nZhihu dataset, reveals that annotators can distinguish between AI-generated and\nhuman-generated texts with an average accuracy rate of 96.5%. This finding\nnecessitates a re-evaluation of human capability in recognizing AI-generated\ntext in today's widely AI-influenced environment. Furthermore, we present a new\nuser-oriented AI-text detection challenge focusing on the practicality and\neffectiveness of identifying AI-generated text based on user information and\nmultiple responses. The experimental results demonstrate that conducting\ndetection tasks on actual social media platforms proves to be more challenging\ncompared to traditional simulated AI-text detection, resulting in a decreased\naccuracy. On the other hand, user-oriented AI-generated text detection\nsignificantly improve the accuracy of detection.\n","authors":["Wanyun Cui","Linqiu Zhang","Qianle Wang","Shuyang Cai"],"pdf_url":"https://arxiv.org/pdf/2310.08240v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08232v1","updated":"2023-10-12T11:25:46Z","published":"2023-10-12T11:25:46Z","title":"Language Models are Universal Embedders","summary":" In the large language model (LLM) revolution, embedding is a key component of\nvarious systems. For example, it is used to retrieve knowledge or memories for\nLLMs, to build content moderation filters, etc. As such cases span from English\nto other natural or programming languages, from retrieval to classification and\nbeyond, it is desirable to build a unified embedding model rather than\ndedicated ones for each scenario. In this work, we make an initial step towards\nthis goal, demonstrating that multiple languages (both natural and programming)\npre-trained transformer decoders can embed universally when finetuned on\nlimited English data. We provide a comprehensive practice with thorough\nevaluations. On English MTEB, our models achieve competitive performance on\ndifferent embedding tasks by minimal training data. On other benchmarks, such\nas multilingual classification and code search, our models (without any\nsupervision) perform comparably to, or even surpass heavily supervised\nbaselines and/or APIs. These results provide evidence of a promising path\ntowards building powerful unified embedders that can be applied across tasks\nand languages.\n","authors":["Xin Zhang","Zehan Li","Yanzhao Zhang","Dingkun Long","Pengjun Xie","Meishan Zhang","Min Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.08232v1.pdf","comment":"13 pages, in progress"},{"id":"http://arxiv.org/abs/2310.08225v1","updated":"2023-10-12T11:17:40Z","published":"2023-10-12T11:17:40Z","title":"Fast Word Error Rate Estimation Using Self-Supervised Representations\n For Speech And Text","summary":" The quality of automatic speech recognition (ASR) is typically measured by\nword error rate (WER). WER estimation is a task aiming to predict the WER of an\nASR system, given a speech utterance and a transcription. This task has gained\nincreasing attention while advanced ASR systems are trained on large amounts of\ndata. In this case, WER estimation becomes necessary in many scenarios, for\nexample, selecting training data with unknown transcription quality or\nestimating the testing performance of an ASR system without ground truth\ntranscriptions. Facing large amounts of data, the computation efficiency of a\nWER estimator becomes essential in practical applications. However, previous\nworks usually did not consider it as a priority. In this paper, a Fast WER\nestimator (Fe-WER) using self-supervised learning representation (SSLR) is\nintroduced. The estimator is built upon SSLR aggregated by average pooling. The\nresults show that Fe-WER outperformed the e-WER3 baseline relatively by 19.69%\nand 7.16% on Ted-Lium3 in both evaluation metrics of root mean square error and\nPearson correlation coefficient, respectively. Moreover, the estimation\nweighted by duration was 10.43% when the target was 10.88%. Lastly, the\ninference speed was about 4x in terms of a real-time factor.\n","authors":["Chanho Park","Chengsong Lu","Mingjie Chen","Thomas Hain"],"pdf_url":"https://arxiv.org/pdf/2310.08225v1.pdf","comment":"5 pages"},{"id":"http://arxiv.org/abs/2310.08221v1","updated":"2023-10-12T11:11:54Z","published":"2023-10-12T11:11:54Z","title":"SimCKP: Simple Contrastive Learning of Keyphrase Representations","summary":" Keyphrase generation (KG) aims to generate a set of summarizing words or\nphrases given a source document, while keyphrase extraction (KE) aims to\nidentify them from the text. Because the search space is much smaller in KE, it\nis often combined with KG to predict keyphrases that may or may not exist in\nthe corresponding document. However, current unified approaches adopt sequence\nlabeling and maximization-based generation that primarily operate at a token\nlevel, falling short in observing and scoring keyphrases as a whole. In this\nwork, we propose SimCKP, a simple contrastive learning framework that consists\nof two stages: 1) An extractor-generator that extracts keyphrases by learning\ncontext-aware phrase-level representations in a contrastive manner while also\ngenerating keyphrases that do not appear in the document; 2) A reranker that\nadapts scores for each generated phrase by likewise aligning their\nrepresentations with the corresponding document. Experimental results on\nmultiple benchmark datasets demonstrate the effectiveness of our proposed\napproach, which outperforms the state-of-the-art models by a significant\nmargin.\n","authors":["Minseok Choi","Chaeheon Gwak","Seho Kim","Si Hyeong Kim","Jaegul Choo"],"pdf_url":"https://arxiv.org/pdf/2310.08221v1.pdf","comment":"Accepted to Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2309.16292v2","updated":"2023-10-12T11:11:47Z","published":"2023-09-28T09:41:35Z","title":"DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large\n Language Models","summary":" Recent advancements in autonomous driving have relied on data-driven\napproaches, which are widely adopted but face challenges including dataset\nbias, overfitting, and uninterpretability. Drawing inspiration from the\nknowledge-driven nature of human driving, we explore the question of how to\ninstill similar capabilities into autonomous driving systems and summarize a\nparadigm that integrates an interactive environment, a driver agent, as well as\na memory component to address this question. Leveraging large language models\nwith emergent abilities, we propose the DiLu framework, which combines a\nReasoning and a Reflection module to enable the system to perform\ndecision-making based on common-sense knowledge and evolve continuously.\nExtensive experiments prove DiLu's capability to accumulate experience and\ndemonstrate a significant advantage in generalization ability over\nreinforcement learning-based methods. Moreover, DiLu is able to directly\nacquire experiences from real-world datasets which highlights its potential to\nbe deployed on practical autonomous driving systems. To the best of our\nknowledge, we are the first to instill knowledge-driven capability into\nautonomous driving systems from the perspective of how humans drive.\n","authors":["Licheng Wen","Daocheng Fu","Xin Li","Xinyu Cai","Tao Ma","Pinlong Cai","Min Dou","Botian Shi","Liang He","Yu Qiao"],"pdf_url":"https://arxiv.org/pdf/2309.16292v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08187v1","updated":"2023-10-12T10:26:26Z","published":"2023-10-12T10:26:26Z","title":"Visual Question Generation in Bengali","summary":" The task of Visual Question Generation (VQG) is to generate human-like\nquestions relevant to the given image. As VQG is an emerging research field,\nexisting works tend to focus only on resource-rich language such as English due\nto the availability of datasets. In this paper, we propose the first Bengali\nVisual Question Generation task and develop a novel transformer-based\nencoder-decoder architecture that generates questions in Bengali when given an\nimage. We propose multiple variants of models - (i) image-only: baseline model\nof generating questions from images without additional information, (ii)\nimage-category and image-answer-category: guided VQG where we condition the\nmodel to generate questions based on the answer and the category of expected\nquestion. These models are trained and evaluated on the translated VQAv2.0\ndataset. Our quantitative and qualitative results establish the first state of\nthe art models for VQG task in Bengali and demonstrate that our models are\ncapable of generating grammatically correct and relevant questions. Our\nquantitative results show that our image-cat model achieves a BLUE-1 score of\n33.12 and BLEU-3 score of 7.56 which is the highest of the other two variants.\nWe also perform a human evaluation to assess the quality of the generation\ntasks. Human evaluation suggests that image-cat model is capable of generating\ngoal-driven and attribute-specific questions and also stays relevant to the\ncorresponding image.\n","authors":["Mahmud Hasan","Labiba Islam","Jannatul Ferdous Ruma","Tasmiah Tahsin Mayeesha","Rashedur M. Rahman"],"pdf_url":"https://arxiv.org/pdf/2310.08187v1.pdf","comment":"19 pages including references, 4 figures and 3 tables. Accepted in\n the Proceedings of the Workshop on Multimodal, Multilingual Natural Language\n Generation and Multilingual WebNLG Challenge (MM-NLG 2023)"},{"id":"http://arxiv.org/abs/2310.08185v1","updated":"2023-10-12T10:21:37Z","published":"2023-10-12T10:21:37Z","title":"EIPE-text: Evaluation-Guided Iterative Plan Extraction for Long-Form\n Narrative Text Generation","summary":" Plan-and-Write is a common hierarchical approach in long-form narrative text\ngeneration, which first creates a plan to guide the narrative writing.\nFollowing this approach, several studies rely on simply prompting large\nlanguage models for planning, which often yields suboptimal results. In this\npaper, we propose a new framework called Evaluation-guided Iterative Plan\nExtraction for long-form narrative text generation (EIPE-text), which extracts\nplans from the corpus of narratives and utilizes the extracted plans to\nconstruct a better planner. EIPE-text has three stages: plan extraction,\nlearning, and inference. In the plan extraction stage, it iteratively extracts\nand improves plans from the narrative corpus and constructs a plan corpus. We\npropose a question answer (QA) based evaluation mechanism to automatically\nevaluate the plans and generate detailed plan refinement instructions to guide\nthe iterative improvement. In the learning stage, we build a better planner by\nfine-tuning with the plan corpus or in-context learning with examples in the\nplan corpus. Finally, we leverage a hierarchical approach to generate long-form\nnarratives. We evaluate the effectiveness of EIPE-text in the domains of novels\nand storytelling. Both GPT-4-based evaluations and human evaluations\ndemonstrate that our method can generate more coherent and relevant long-form\nnarratives. Our code will be released in the future.\n","authors":["Wang You","Wenshan Wu","Yaobo Liang","Shaoguang Mao","Chenfei Wu","Maosong Cao","Yuzhe Cai","Yiduo Guo","Yan Xia","Furu Wei","Nan Duan"],"pdf_url":"https://arxiv.org/pdf/2310.08185v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08172v1","updated":"2023-10-12T09:55:45Z","published":"2023-10-12T09:55:45Z","title":"Exploring the Cognitive Knowledge Structure of Large Language Models: An\n Educational Diagnostic Assessment Approach","summary":" Large Language Models (LLMs) have not only exhibited exceptional performance\nacross various tasks, but also demonstrated sparks of intelligence. Recent\nstudies have focused on assessing their capabilities on human exams and\nrevealed their impressive competence in different domains. However, cognitive\nresearch on the overall knowledge structure of LLMs is still lacking. In this\npaper, based on educational diagnostic assessment method, we conduct an\nevaluation using MoocRadar, a meticulously annotated human test dataset based\non Bloom Taxonomy. We aim to reveal the knowledge structures of LLMs and gain\ninsights of their cognitive capabilities. This research emphasizes the\nsignificance of investigating LLMs' knowledge and understanding the disparate\ncognitive patterns of LLMs. By shedding light on models' knowledge, researchers\ncan advance development and utilization of LLMs in a more informed and\neffective manner.\n","authors":["Zheyuan Zhang","Jifan Yu","Juanzi Li","Lei Hou"],"pdf_url":"https://arxiv.org/pdf/2310.08172v1.pdf","comment":"Findings of EMNLP 2023 (Short Paper)"},{"id":"http://arxiv.org/abs/2310.08170v1","updated":"2023-10-12T09:49:10Z","published":"2023-10-12T09:49:10Z","title":"Simplicity Level Estimate (SLE): A Learned Reference-Less Metric for\n Sentence Simplification","summary":" Automatic evaluation for sentence simplification remains a challenging\nproblem. Most popular evaluation metrics require multiple high-quality\nreferences -- something not readily available for simplification -- which makes\nit difficult to test performance on unseen domains. Furthermore, most existing\nmetrics conflate simplicity with correlated attributes such as fluency or\nmeaning preservation. We propose a new learned evaluation metric (SLE) which\nfocuses on simplicity, outperforming almost all existing metrics in terms of\ncorrelation with human judgements.\n","authors":["Liam Cripwell","Joël Legrand","Claire Gardent"],"pdf_url":"https://arxiv.org/pdf/2310.08170v1.pdf","comment":"Accepted to EMNLP 2023 (Main Conference)"},{"id":"http://arxiv.org/abs/2310.08167v1","updated":"2023-10-12T09:41:22Z","published":"2023-10-12T09:41:22Z","title":"Multiclass Classification of Policy Documents with Large Language Models","summary":" Classifying policy documents into policy issue topics has been a long-time\neffort in political science and communication disciplines. Efforts to automate\ntext classification processes for social science research purposes have so far\nachieved remarkable results, but there is still a large room for progress. In\nthis work, we test the prediction performance of an alternative strategy, which\nrequires human involvement much less than full manual coding. We use the GPT\n3.5 and GPT 4 models of the OpenAI, which are pre-trained instruction-tuned\nLarge Language Models (LLM), to classify congressional bills and congressional\nhearings into Comparative Agendas Project's 21 major policy issue topics. We\npropose three use-case scenarios and estimate overall accuracies ranging from\n%58-83 depending on scenario and GPT model employed. The three scenarios aims\nat minimal, moderate, and major human interference, respectively. Overall, our\nresults point towards the insufficiency of complete reliance on GPT with\nminimal human intervention, an increasing accuracy along with the human effort\nexerted, and a surprisingly high accuracy achieved in the most humanly\ndemanding use-case. However, the superior use-case achieved the %83 accuracy on\nthe %65 of the data in which the two models agreed, suggesting that a similar\napproach to ours can be relatively easily implemented and allow for mostly\nautomated coding of a majority of a given dataset. This could free up resources\nallowing manual human coding of the remaining %35 of the data to achieve an\noverall higher level of accuracy while reducing costs significantly.\n","authors":["Erkan Gunes","Christoffer Koch Florczak"],"pdf_url":"https://arxiv.org/pdf/2310.08167v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08166v1","updated":"2023-10-12T09:39:17Z","published":"2023-10-12T09:39:17Z","title":"Ziya-VL: Bilingual Large Vision-Language Model via Multi-Task\n Instruction Tuning","summary":" Recent advancements enlarge the capabilities of large language models (LLMs)\nin zero-shot image-to-text generation and understanding by integrating\nmulti-modal inputs. However, such success is typically limited to English\nscenarios due to the lack of large-scale and high-quality non-English\nmulti-modal resources, making it extremely difficult to establish competitive\ncounterparts in other languages. In this paper, we introduce the Ziya-VL\nseries, a set of bilingual large-scale vision-language models (LVLMs) designed\nto incorporate visual semantics into LLM for multi-modal dialogue. Composed of\nZiya-VL-Base and Ziya-VL-Chat, our models adopt the Querying Transformer from\nBLIP-2, further exploring the assistance of optimization schemes such as\ninstruction tuning, multi-stage training and low-rank adaptation module for\nvisual-language alignment. In addition, we stimulate the understanding ability\nof GPT-4 in multi-modal scenarios, translating our gathered English image-text\ndatasets into Chinese and generating instruction-response through the\nin-context learning method. The experiment results demonstrate that compared to\nthe existing LVLMs, Ziya-VL achieves competitive performance across a wide\nrange of English-only tasks including zero-shot image-text retrieval, image\ncaptioning, and visual question answering. The evaluation leaderboard accessed\nby GPT-4 also indicates that our models possess satisfactory image-text\nunderstanding and generation capabilities in Chinese multi-modal scenario\ndialogues. Code, demo and models are available at\n~\\url{https://huggingface.co/IDEA-CCNL/Ziya-BLIP2-14B-Visual-v1}.\n","authors":["Junyu Lu","Dixiang Zhang","Xiaojun Wu","Xinyu Gao","Ruyi Gan","Jiaxing Zhang","Yan Song","Pingjian Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.08166v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08152v1","updated":"2023-10-12T09:18:19Z","published":"2023-10-12T09:18:19Z","title":"Context Compression for Auto-regressive Transformers with Sentinel\n Tokens","summary":" The quadratic complexity of the attention module makes it gradually become\nthe bulk of compute in Transformer-based LLMs during generation. Moreover, the\nexcessive key-value cache that arises when dealing with long inputs also brings\nsevere issues on memory footprint and inference latency. In this work, we\npropose a plug-and-play approach that is able to incrementally compress the\nintermediate activation of a specified span of tokens into compact ones,\nthereby reducing both memory and computational cost when processing subsequent\ncontext. Experiments on both in-domain language modeling and zero-shot\nopen-ended document generation demonstrate the advantage of our approach over\nsparse attention baselines in terms of fluency, n-gram matching, and semantic\nsimilarity. At last, we comprehensively profile the benefit of context\ncompression on improving the system throughout. Code is available at\nhttps://github.com/DRSY/KV_Compression.\n","authors":["Siyu Ren","Qi Jia","Kenny Q. Zhu"],"pdf_url":"https://arxiv.org/pdf/2310.08152v1.pdf","comment":"To appear at EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.05199v2","updated":"2023-10-12T09:04:07Z","published":"2023-10-08T15:14:39Z","title":"Loose lips sink ships: Mitigating Length Bias in Reinforcement Learning\n from Human Feedback","summary":" Reinforcement learning from human feedback serves as a crucial bridge,\naligning large language models with human and societal values. This alignment\nrequires a vast corpus of human feedback to learn a reward model, which is\nsubsequently used to finetune language models. However, we have identified that\nthe reward model often finds shortcuts to bypass its intended objectives,\nmisleadingly assuming that humans prefer longer responses. The emergence of\nlength bias often induces the model to favor longer outputs, yet it doesn't\nequate to an increase in helpful information within these outputs. In this\npaper, we propose an innovative solution, applying the Product-of-Experts (PoE)\ntechnique to separate reward modeling from the influence of sequence length. In\nour framework, the main expert concentrates on understanding human intents,\nwhile the biased expert targets the identification and capture of length bias.\nTo further enhance the learning of bias, we introduce perturbations into the\nbias-focused expert, disrupting the flow of semantic information. Experimental\nresults validate the effectiveness of our approach, indicating that language\nmodel performance is improved, irrespective of sequence length.\n","authors":["Wei Shen","Rui Zheng","Wenyu Zhan","Jun Zhao","Shihan Dou","Tao Gui","Qi Zhang","Xuanjing Huang"],"pdf_url":"https://arxiv.org/pdf/2310.05199v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.11534v4","updated":"2023-10-12T08:50:19Z","published":"2023-08-21T06:51:56Z","title":"PlatoLM: Teaching LLMs via a Socratic Questioning User Simulator","summary":" The unparalleled performance of closed-sourced ChatGPT has sparked efforts\ntowards its democratization, with notable strides made by leveraging real user\nand ChatGPT conversations, as evidenced by Vicuna. However, due to challenges\nin gathering conversations involving human participation, current endeavors\nlike Baize and UltraChat aim to automatically generate conversational data.\nThey primarily rely on ChatGPT conducting roleplay to simulate human behaviors\nbased on instructions rather than genuine learning from humans, resulting in\nlimited scope, diminished diversity, and an absence of genuine multi-round\nconversational dynamics. To address the above issues, we target human questions\nextracted from genuine human-machine conversations as a learning goal and train\na user simulator called `Socratic' to produce a high-quality human-centric\nsynthetic conversation dataset. Subsequently, this dataset was used to train\nour assistant model, named `PlatoLM'. Experimentally, PlatoLM outpaces baseline\nmodels in both Vicuna-Bench and MT-Bench by pairwise comparison when\nconsidering equivalent training set sizes, and manual evaluation also shows\nthat our model is highly competitive. Impressively, when fine-tuned with the\nlatest LLaMA 2 model, PlatoLM achieves the SOTA performance among 7B models\n(including LLaMA-2-7B-chat and Vicuna-7B) in MT-Bench benchmark and in\nAlpaca-Eval benchmark, it ranks second among 7B models, even beating some\nlarger scale models (including LLaMA-2-13B-chat and GPT-3.5). Further in-depth\nanalysis demonstrates the scalability and transferability of our approach. The\ncode is available at https://github.com/FreedomIntelligence/PlatoLM.\n","authors":["Chuyi Kong","Yaxin Fan","Xiang Wan","Feng Jiang","Benyou Wang"],"pdf_url":"https://arxiv.org/pdf/2308.11534v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08132v1","updated":"2023-10-12T08:45:21Z","published":"2023-10-12T08:45:21Z","title":"On the Relevance of Phoneme Duration Variability of Synthesized Training\n Data for Automatic Speech Recognition","summary":" Synthetic data generated by text-to-speech (TTS) systems can be used to\nimprove automatic speech recognition (ASR) systems in low-resource or domain\nmismatch tasks. It has been shown that TTS-generated outputs still do not have\nthe same qualities as real data. In this work we focus on the temporal\nstructure of synthetic data and its relation to ASR training. By using a novel\noracle setup we show how much the degradation of synthetic data quality is\ninfluenced by duration modeling in non-autoregressive (NAR) TTS. To get\nreference phoneme durations we use two common alignment methods, a hidden\nMarkov Gaussian-mixture model (HMM-GMM) aligner and a neural connectionist\ntemporal classification (CTC) aligner. Using a simple algorithm based on random\nwalks we shift phoneme duration distributions of the TTS system closer to real\ndurations, resulting in an improvement of an ASR system using synthetic data in\na semi-supervised setting.\n","authors":["Nick Rossenbach","Benedikt Hilmes","Ralf Schlüter"],"pdf_url":"https://arxiv.org/pdf/2310.08132v1.pdf","comment":"To appear at ASRU 2023"},{"id":"http://arxiv.org/abs/2310.08130v1","updated":"2023-10-12T08:38:12Z","published":"2023-10-12T08:38:12Z","title":"Fine-grained Conversational Decoding via Isotropic and Proximal Search","summary":" General-purpose text decoding approaches are usually adopted for dialogue\nresponse generation. Although the quality of the generated responses can be\nimproved with dialogue-specific encoding methods, conversational decoding\nmethods are still under-explored. Inspired by \\citet{wu2023learning} that a\ngood dialogue feature space should follow the rules of locality and isotropy,\nwe present a fine-grained conversational decoding method, termed\n\\textit{isotropic and proximal search (IPS)}. Our method is designed to\ngenerate the semantic-concentrated response, while still maintaining\ninformativeness and discrimination against the context. Experiments show that\nour approach outperforms existing decoding strategies in the dialogue field\nacross both automatic and human evaluation metrics. More in-depth analyses\nfurther confirm the effectiveness of our approach.\n","authors":["Yuxuan Yao","Han Wu","Qiling Xu","Linqi Song"],"pdf_url":"https://arxiv.org/pdf/2310.08130v1.pdf","comment":"To appear in EMNLP 2024"},{"id":"http://arxiv.org/abs/2310.08123v1","updated":"2023-10-12T08:24:15Z","published":"2023-10-12T08:24:15Z","title":"Who Wrote it and Why? Prompting Large-Language Models for Authorship\n Verification","summary":" Authorship verification (AV) is a fundamental task in natural language\nprocessing (NLP) and computational linguistics, with applications in forensic\nanalysis, plagiarism detection, and identification of deceptive content.\nExisting AV techniques, including traditional stylometric and deep learning\napproaches, face limitations in terms of data requirements and lack of\nexplainability. To address these limitations, this paper proposes PromptAV, a\nnovel technique that leverages Large-Language Models (LLMs) for AV by providing\nstep-by-step stylometric explanation prompts. PromptAV outperforms\nstate-of-the-art baselines, operates effectively with limited training data,\nand enhances interpretability through intuitive explanations, showcasing its\npotential as an effective and interpretable solution for the AV task.\n","authors":["Chia-Yu Hung","Zhiqiang Hu","Yujia Hu","Roy Ka-Wei Lee"],"pdf_url":"https://arxiv.org/pdf/2310.08123v1.pdf","comment":"7 pages,1 figure"},{"id":"http://arxiv.org/abs/2310.08104v1","updated":"2023-10-12T08:00:25Z","published":"2023-10-12T08:00:25Z","title":"Voice Conversion for Stuttered Speech, Instruments, Unseen Languages and\n Textually Described Voices","summary":" Voice conversion aims to convert source speech into a target voice using\nrecordings of the target speaker as a reference. Newer models are producing\nincreasingly realistic output. But what happens when models are fed with\nnon-standard data, such as speech from a user with a speech impairment? We\ninvestigate how a recent voice conversion model performs on non-standard\ndownstream voice conversion tasks. We use a simple but robust approach called\nk-nearest neighbors voice conversion (kNN-VC). We look at four non-standard\napplications: stuttered voice conversion, cross-lingual voice conversion,\nmusical instrument conversion, and text-to-voice conversion. The latter\ninvolves converting to a target voice specified through a text description,\ne.g. \"a young man with a high-pitched voice\". Compared to an established\nbaseline, we find that kNN-VC retains high performance in stuttered and\ncross-lingual voice conversion. Results are more mixed for the musical\ninstrument and text-to-voice conversion tasks. E.g., kNN-VC works well on some\ninstruments like drums but not on others. Nevertheless, this shows that voice\nconversion models - and kNN-VC in particular - are increasingly applicable in a\nrange of non-standard downstream tasks. But there are still limitations when\nsamples are very far from the training distribution. Code, samples, trained\nmodels: https://rf5.github.io/sacair2023-knnvc-demo/.\n","authors":["Matthew Baas","Herman Kamper"],"pdf_url":"https://arxiv.org/pdf/2310.08104v1.pdf","comment":"11 pages, 1 figure, 5 tables. Accepted at SACAIR 2023"},{"id":"http://arxiv.org/abs/2310.01917v2","updated":"2023-10-12T07:59:56Z","published":"2023-10-03T09:46:02Z","title":"Hierarchical Evaluation Framework: Best Practices for Human Evaluation","summary":" Human evaluation plays a crucial role in Natural Language Processing (NLP) as\nit assesses the quality and relevance of developed systems, thereby\nfacilitating their enhancement. However, the absence of widely accepted human\nevaluation metrics in NLP hampers fair comparisons among different systems and\nthe establishment of universal assessment standards. Through an extensive\nanalysis of existing literature on human evaluation metrics, we identified\nseveral gaps in NLP evaluation methodologies. These gaps served as motivation\nfor developing our own hierarchical evaluation framework. The proposed\nframework offers notable advantages, particularly in providing a more\ncomprehensive representation of the NLP system's performance. We applied this\nframework to evaluate the developed Machine Reading Comprehension system, which\nwas utilized within a human-AI symbiosis model. The results highlighted the\nassociations between the quality of inputs and outputs, underscoring the\nnecessity to evaluate both components rather than solely focusing on outputs.\nIn future work, we will investigate the potential time-saving benefits of our\nproposed framework for evaluators assessing NLP systems.\n","authors":["Iva Bojic","Jessica Chen","Si Yuan Chang","Qi Chwen Ong","Shafiq Joty","Josip Car"],"pdf_url":"https://arxiv.org/pdf/2310.01917v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07282v2","updated":"2023-10-12T07:53:53Z","published":"2023-10-11T08:16:35Z","title":"An Analysis on Large Language Models in Healthcare: A Case Study of\n BioBERT","summary":" This paper conducts a comprehensive investigation into applying large\nlanguage models, particularly on BioBERT, in healthcare. It begins with\nthoroughly examining previous natural language processing (NLP) approaches in\nhealthcare, shedding light on the limitations and challenges these methods\nface. Following that, this research explores the path that led to the\nincorporation of BioBERT into healthcare applications, highlighting its\nsuitability for addressing the specific requirements of tasks related to\nbiomedical text mining. The analysis outlines a systematic methodology for\nfine-tuning BioBERT to meet the unique needs of the healthcare domain. This\napproach includes various components, including the gathering of data from a\nwide range of healthcare sources, data annotation for tasks like identifying\nmedical entities and categorizing them, and the application of specialized\npreprocessing techniques tailored to handle the complexities found in\nbiomedical texts. Additionally, the paper covers aspects related to model\nevaluation, with a focus on healthcare benchmarks and functions like processing\nof natural language in biomedical, question-answering, clinical document\nclassification, and medical entity recognition. It explores techniques to\nimprove the model's interpretability and validates its performance compared to\nexisting healthcare-focused language models. The paper thoroughly examines\nethical considerations, particularly patient privacy and data security. It\nhighlights the benefits of incorporating BioBERT into healthcare contexts,\nincluding enhanced clinical decision support and more efficient information\nretrieval. Nevertheless, it acknowledges the impediments and complexities of\nthis integration, encompassing concerns regarding data privacy, transparency,\nresource-intensive requirements, and the necessity for model customization to\nalign with diverse healthcare domains.\n","authors":["Shyni Sharaf","V. S. Anoop"],"pdf_url":"https://arxiv.org/pdf/2310.07282v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08102v1","updated":"2023-10-12T07:52:19Z","published":"2023-10-12T07:52:19Z","title":"QASiNa: Religious Domain Question Answering using Sirah Nabawiyah","summary":" Nowadays, Question Answering (QA) tasks receive significant research focus,\nparticularly with the development of Large Language Model (LLM) such as Chat\nGPT [1]. LLM can be applied to various domains, but it contradicts the\nprinciples of information transmission when applied to the Islamic domain. In\nIslam we strictly regulates the sources of information and who can give\ninterpretations or tafseer for that sources [2]. The approach used by LLM to\ngenerate answers based on its own interpretation is similar to the concept of\ntafseer, LLM is neither an Islamic expert nor a human which is not permitted in\nIslam. Indonesia is the country with the largest Islamic believer population in\nthe world [3]. With the high influence of LLM, we need to make evaluation of\nLLM in religious domain. Currently, there is only few religious QA dataset\navailable and none of them using Sirah Nabawiyah especially in Indonesian\nLanguage. In this paper, we propose the Question Answering Sirah Nabawiyah\n(QASiNa) dataset, a novel dataset compiled from Sirah Nabawiyah literatures in\nIndonesian language. We demonstrate our dataset by using mBERT [4], XLM-R [5],\nand IndoBERT [6] which fine-tuned with Indonesian translation of SQuAD v2.0\n[7]. XLM-R model returned the best performance on QASiNa with EM of 61.20,\nF1-Score of 75.94, and Substring Match of 70.00. We compare XLM-R performance\nwith Chat GPT-3.5 and GPT-4 [1]. Both Chat GPT version returned lower EM and\nF1-Score with higher Substring Match, the gap of EM and Substring Match get\nwider in GPT-4. The experiment indicate that Chat GPT tends to give excessive\ninterpretations as evidenced by its higher Substring Match scores compared to\nEM and F1-Score, even after providing instruction and context. This concludes\nChat GPT is unsuitable for question answering task in religious domain\nespecially for Islamic religion.\n","authors":["Muhammad Razif Rizqullah","Ayu Purwarianti","Alham Fikri Aji"],"pdf_url":"https://arxiv.org/pdf/2310.08102v1.pdf","comment":"6 Pages. In Proceeding of 10th International Conference on Advanced\n Informatics: Concepts, Theory and Applications (ICAICTA 2023)"},{"id":"http://arxiv.org/abs/2310.08101v1","updated":"2023-10-12T07:51:43Z","published":"2023-10-12T07:51:43Z","title":"Promptor: A Conversational and Autonomous Prompt Generation Agent for\n Intelligent Text Entry Techniques","summary":" Text entry is an essential task in our day-to-day digital interactions.\nNumerous intelligent features have been developed to streamline this process,\nmaking text entry more effective, efficient, and fluid. These improvements\ninclude sentence prediction and user personalization. However, as deep\nlearning-based language models become the norm for these advanced features, the\nnecessity for data collection and model fine-tuning increases. These challenges\ncan be mitigated by harnessing the in-context learning capability of large\nlanguage models such as GPT-3.5. This unique feature allows the language model\nto acquire new skills through prompts, eliminating the need for data collection\nand fine-tuning. Consequently, large language models can learn various text\nprediction techniques. We initially showed that, for a sentence prediction\ntask, merely prompting GPT-3.5 surpassed a GPT-2 backed system and is\ncomparable with a fine-tuned GPT-3.5 model, with the latter two methods\nrequiring costly data collection, fine-tuning and post-processing. However, the\ntask of prompting large language models to specialize in specific text\nprediction tasks can be challenging, particularly for designers without\nexpertise in prompt engineering. To address this, we introduce Promptor, a\nconversational prompt generation agent designed to engage proactively with\ndesigners. Promptor can automatically generate complex prompts tailored to meet\nspecific needs, thus offering a solution to this challenge. We conducted a user\nstudy involving 24 participants creating prompts for three intelligent text\nentry tasks, half of the participants used Promptor while the other half\ndesigned prompts themselves. The results show that Promptor-designed prompts\nresult in a 35% increase in similarity and 22% in coherence over those by\ndesigners.\n","authors":["Junxiao Shen","John J. Dudley","Jingyao Zheng","Bill Byrne","Per Ola Kristensson"],"pdf_url":"https://arxiv.org/pdf/2310.08101v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08099v1","updated":"2023-10-12T07:48:50Z","published":"2023-10-12T07:48:50Z","title":"ClimateNLP: Analyzing Public Sentiment Towards Climate Change Using\n Natural Language Processing","summary":" Climate change's impact on human health poses unprecedented and diverse\nchallenges. Unless proactive measures based on solid evidence are implemented,\nthese threats will likely escalate and continue to endanger human well-being.\nThe escalating advancements in information and communication technologies have\nfacilitated the widespread availability and utilization of social media\nplatforms. Individuals utilize platforms such as Twitter and Facebook to\nexpress their opinions, thoughts, and critiques on diverse subjects,\nencompassing the pressing issue of climate change. The proliferation of climate\nchange-related content on social media necessitates comprehensive analysis to\nglean meaningful insights. This paper employs natural language processing (NLP)\ntechniques to analyze climate change discourse and quantify the sentiment of\nclimate change-related tweets. We use ClimateBERT, a pretrained model\nfine-tuned specifically for the climate change domain. The objective is to\ndiscern the sentiment individuals express and uncover patterns in public\nopinion concerning climate change. Analyzing tweet sentiments allows a deeper\ncomprehension of public perceptions, concerns, and emotions about this critical\nglobal challenge. The findings from this experiment unearth valuable insights\ninto public sentiment and the entities associated with climate change\ndiscourse. Policymakers, researchers, and organizations can leverage such\nanalyses to understand public perceptions, identify influential actors, and\ndevise informed strategies to address climate change challenges.\n","authors":["Ajay Krishnan T. K.","V. S. Anoop"],"pdf_url":"https://arxiv.org/pdf/2310.08099v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08085v1","updated":"2023-10-12T07:17:17Z","published":"2023-10-12T07:17:17Z","title":"Low-Resource Clickbait Spoiling for Indonesian via Question Answering","summary":" Clickbait spoiling aims to generate a short text to satisfy the curiosity\ninduced by a clickbait post. As it is a newly introduced task, the dataset is\nonly available in English so far. Our contributions include the construction of\nmanually labeled clickbait spoiling corpus in Indonesian and an evaluation on\nusing cross-lingual zero-shot question answering-based models to tackle\nclikcbait spoiling for low-resource language like Indonesian. We utilize\nselection of multilingual language models. The experimental results suggest\nthat XLM-RoBERTa (large) model outperforms other models for phrase and passage\nspoilers, meanwhile, mDeBERTa (base) model outperforms other models for\nmultipart spoilers.\n","authors":["Ni Putu Intan Maharani","Ayu Purwarianti","Alham Fikri Aji"],"pdf_url":"https://arxiv.org/pdf/2310.08085v1.pdf","comment":"Accepted in ICAICTA 2023 (10th International Conference on Advanced\n Informatics: Concepts, Theory and Applications)"},{"id":"http://arxiv.org/abs/2310.08078v1","updated":"2023-10-12T06:59:10Z","published":"2023-10-12T06:59:10Z","title":"To token or not to token: A Comparative Study of Text Representations\n for Cross-Lingual Transfer","summary":" Choosing an appropriate tokenization scheme is often a bottleneck in\nlow-resource cross-lingual transfer. To understand the downstream implications\nof text representation choices, we perform a comparative analysis on language\nmodels having diverse text representation modalities including 2\nsegmentation-based models (\\texttt{BERT}, \\texttt{mBERT}), 1 image-based model\n(\\texttt{PIXEL}), and 1 character-level model (\\texttt{CANINE}). First, we\npropose a scoring Language Quotient (LQ) metric capable of providing a weighted\nrepresentation of both zero-shot and few-shot evaluation combined. Utilizing\nthis metric, we perform experiments comprising 19 source languages and 133\ntarget languages on three tasks (POS tagging, Dependency parsing, and NER). Our\nanalysis reveals that image-based models excel in cross-lingual transfer when\nlanguages are closely related and share visually similar scripts. However, for\ntasks biased toward word meaning (POS, NER), segmentation-based models prove to\nbe superior. Furthermore, in dependency parsing tasks where word relationships\nplay a crucial role, models with their character-level focus, outperform\nothers. Finally, we propose a recommendation scheme based on our findings to\nguide model selection according to task and language requirements.\n","authors":["Md Mushfiqur Rahman","Fardin Ahsan Sakib","Fahim Faisal","Antonios Anastasopoulos"],"pdf_url":"https://arxiv.org/pdf/2310.08078v1.pdf","comment":"Accepted at 3RD MULTILINGUAL REPRESENTATION LEARNING (MRL) WORKSHOP,\n 2023"},{"id":"http://arxiv.org/abs/2310.08072v1","updated":"2023-10-12T06:46:07Z","published":"2023-10-12T06:46:07Z","title":"Training Generative Question-Answering on Synthetic Data Obtained from\n an Instruct-tuned Mo","summary":" This paper presents a simple and cost-effective method for synthesizing data\nto train question-answering systems. For training, fine-tuning GPT models is a\ncommon practice in resource-rich languages like English, however, it becomes\nchallenging for non-English languages due to the scarcity of sufficient\nquestion-answer (QA) pairs. Existing approaches use question and answer\ngenerators trained on human-authored QA pairs, which involves substantial human\nexpenses. In contrast, we use an instruct-tuned model to generate QA pairs in a\nzero-shot or few-shot manner. We conduct experiments to compare various\nstrategies for obtaining QA pairs from the instruct-tuned model. The results\ndemonstrate that a model trained on our proposed synthetic data achieves\ncomparable performance to a model trained on manually curated datasets, without\nincurring human costs.\n","authors":["Kosuke Takahashi","Takahiro Omi","Kosuke Arima","Tatsuya Ishigaki"],"pdf_url":"https://arxiv.org/pdf/2310.08072v1.pdf","comment":"PACLIC 2023 short paper, 4 pages (6 pages including references), 4\n figures"},{"id":"http://arxiv.org/abs/2305.07375v4","updated":"2023-10-12T06:42:25Z","published":"2023-05-12T10:54:13Z","title":"Is ChatGPT a Good Causal Reasoner? A Comprehensive Evaluation","summary":" Causal reasoning ability is crucial for numerous NLP applications. Despite\nthe impressive emerging ability of ChatGPT in various NLP tasks, it is unclear\nhow well ChatGPT performs in causal reasoning. In this paper, we conduct the\nfirst comprehensive evaluation of the ChatGPT's causal reasoning capabilities.\nExperiments show that ChatGPT is not a good causal reasoner, but a good causal\nexplainer. Besides, ChatGPT has a serious hallucination on causal reasoning,\npossibly due to the reporting biases between causal and non-causal\nrelationships in natural language, as well as ChatGPT's upgrading processes,\nsuch as RLHF. The In-Context Learning (ICL) and Chain-of-Thought (CoT)\ntechniques can further exacerbate such causal hallucination. Additionally, the\ncausal reasoning ability of ChatGPT is sensitive to the words used to express\nthe causal concept in prompts, and close-ended prompts perform better than\nopen-ended prompts. For events in sentences, ChatGPT excels at capturing\nexplicit causality rather than implicit causality, and performs better in\nsentences with lower event density and smaller lexical distance between events.\nThe code is available on https://github.com/ArrogantL/ChatGPT4CausalReasoning .\n","authors":["Jinglong Gao","Xiao Ding","Bing Qin","Ting Liu"],"pdf_url":"https://arxiv.org/pdf/2305.07375v4.pdf","comment":"Accepted to Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.08069v1","updated":"2023-10-12T06:32:42Z","published":"2023-10-12T06:32:42Z","title":"Rethinking Negative Pairs in Code Search","summary":" Recently, contrastive learning has become a key component in fine-tuning code\nsearch models for software development efficiency and effectiveness. It pulls\ntogether positive code snippets while pushing negative samples away given\nsearch queries. Among contrastive learning, InfoNCE is the most widely used\nloss function due to its better performance. However, the following problems in\nnegative samples of InfoNCE may deteriorate its representation learning: 1) The\nexistence of false negative samples in large code corpora due to duplications.\n2). The failure to explicitly differentiate between the potential relevance of\nnegative samples. As an example, a bubble sorting algorithm example is less\n``negative'' than a file saving function for the quick sorting algorithm query.\nIn this paper, we tackle the above problems by proposing a simple yet effective\nSoft-InfoNCE loss that inserts weight terms into InfoNCE. In our proposed loss\nfunction, we apply three methods to estimate the weights of negative pairs and\nshow that the vanilla InfoNCE loss is a special case of Soft-InfoNCE.\nTheoretically, we analyze the effects of Soft-InfoNCE on controlling the\ndistribution of learnt code representations and on deducing a more precise\nmutual information estimation. We furthermore discuss the superiority of\nproposed loss functions with other design alternatives. Extensive experiments\ndemonstrate the effectiveness of Soft-InfoNCE and weights estimation methods\nunder state-of-the-art code search models on a large-scale public dataset\nconsisting of six programming languages. Source code is available at\n\\url{https://github.com/Alex-HaochenLi/Soft-InfoNCE}.\n","authors":["Haochen Li","Xin Zhou","Luu Anh Tuan","Chunyan Miao"],"pdf_url":"https://arxiv.org/pdf/2310.08069v1.pdf","comment":"Accepted to EMNLP 2023"},{"id":"http://arxiv.org/abs/2305.14069v2","updated":"2023-10-12T06:20:42Z","published":"2023-05-23T13:48:32Z","title":"Evaluating Factual Consistency of Summaries with Large Language Models","summary":" Detecting factual errors in summaries has been an important and challenging\nsubject in summarization research. Inspired by the emergent ability of large\nlanguage models (LLMs), we explore evaluating factual consistency of summaries\nby directly prompting LLMs. We present a comprehensive empirical study to\nassess the ability of LLMs as factual consistency evaluators, which consists of\n(1) analyzing different LLMs such as the GPT model series and Flan-T5; (2)\ninvestigating a variety of prompting methods including vanilla prompting,\nchain-of-thought prompting, and a sentence-by-sentence prompting method to\ntackle long summaries; and (3) evaluating on diverse summaries generated by\nmultiple summarization systems, ranging from pre-transformer methods to SOTA\npretrained models. Our experiments demonstrate that prompting LLMs is able to\noutperform the previous best factuality systems in all settings, by up to 12.2\nabsolute points in terms of the binary classification accuracy on inconsistency\ndetection.\n","authors":["Shiqi Chen","Siyang Gao","Junxian He"],"pdf_url":"https://arxiv.org/pdf/2305.14069v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08041v1","updated":"2023-10-12T05:25:49Z","published":"2023-10-12T05:25:49Z","title":"QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large\n Language Models","summary":" Large Language Models (LLMs) excel in NLP, but their demands hinder their\nwidespread deployment. While Quantization-Aware Training (QAT) offers a\nsolution, its extensive training costs make Post-Training Quantization (PTQ) a\nmore practical approach for LLMs. In existing studies, activation outliers in\nparticular channels are identified as the bottleneck to PTQ accuracy. They\npropose to transform the magnitudes from activations to weights, which however\noffers limited alleviation or suffers from unstable gradients, resulting in a\nsevere performance drop at low-bitwidth. In this paper, we propose QLLM, an\naccurate and efficient low-bitwidth PTQ method designed for LLMs. QLLM\nintroduces an adaptive channel reassembly technique that reallocates the\nmagnitude of outliers to other channels, thereby mitigating their impact on the\nquantization range. This is achieved by channel disassembly and channel\nassembly, which first breaks down the outlier channels into several\nsub-channels to ensure a more balanced distribution of activation magnitudes.\nThen similar channels are merged to maintain the original channel number for\nefficiency. Additionally, an adaptive strategy is designed to autonomously\ndetermine the optimal number of sub-channels for channel disassembly. To\nfurther compensate for the performance loss caused by quantization, we propose\nan efficient tuning method that only learns a small number of low-rank weights\nwhile freezing the pre-trained quantized model. After training, these low-rank\nparameters can be fused into the frozen weights without affecting inference.\nExtensive experiments on LLaMA-1 and LLaMA-2 show that QLLM can obtain accurate\nquantized models efficiently. For example, QLLM quantizes the 4-bit LLaMA-2-70B\nwithin 10 hours on a single A100-80G GPU, outperforming the previous\nstate-of-the-art method by 7.89% on the average accuracy across five zero-shot\ntasks.\n","authors":["Jing Liu","Ruihao Gong","Xiuying Wei","Zhiwei Dong","Jianfei Cai","Bohan Zhuang"],"pdf_url":"https://arxiv.org/pdf/2310.08041v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08027v1","updated":"2023-10-12T04:14:28Z","published":"2023-10-12T04:14:28Z","title":"Exploring Large Language Models for Multi-Modal Out-of-Distribution\n Detection","summary":" Out-of-distribution (OOD) detection is essential for reliable and trustworthy\nmachine learning. Recent multi-modal OOD detection leverages textual\ninformation from in-distribution (ID) class names for visual OOD detection, yet\nit currently neglects the rich contextual information of ID classes. Large\nlanguage models (LLMs) encode a wealth of world knowledge and can be prompted\nto generate descriptive features for each class. Indiscriminately using such\nknowledge causes catastrophic damage to OOD detection due to LLMs'\nhallucinations, as is observed by our analysis. In this paper, we propose to\napply world knowledge to enhance OOD detection performance through selective\ngeneration from LLMs. Specifically, we introduce a consistency-based\nuncertainty calibration method to estimate the confidence score of each\ngeneration. We further extract visual objects from each image to fully\ncapitalize on the aforementioned world knowledge. Extensive experiments\ndemonstrate that our method consistently outperforms the state-of-the-art.\n","authors":["Yi Dai","Hao Lang","Kaisheng Zeng","Fei Huang","Yongbin Li"],"pdf_url":"https://arxiv.org/pdf/2310.08027v1.pdf","comment":"EMNLP2023 Findings Long Paper"},{"id":"http://arxiv.org/abs/2309.10691v2","updated":"2023-10-12T04:07:56Z","published":"2023-09-19T15:25:42Z","title":"MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language\n Feedback","summary":" To solve complex tasks, large language models (LLMs) often require multiple\nrounds of interactions with the user, sometimes assisted by external tools.\nHowever, current evaluation protocols often emphasize benchmark performance\nwith single-turn exchanges, neglecting the nuanced interactions among the user,\nLLMs, and external tools, while also underestimating the importance of natural\nlanguage feedback from users. These oversights contribute to discrepancies\nbetween research benchmark evaluations and real-world use cases. We introduce\nMINT, a benchmark that evaluates LLMs' ability to solve tasks with multi-turn\ninteractions by (1) using tools and (2) leveraging natural language feedback.\nTo ensure reproducibility, we provide an evaluation framework where LLMs can\naccess tools by executing Python code and receive users' natural language\nfeedback simulated by GPT-4. We repurpose a diverse set of established\nevaluation datasets focusing on reasoning, coding, and decision-making and\ncarefully curate them into a compact subset for efficient evaluation. Our\nanalysis of 20 open- and closed-source LLMs offers intriguing findings. (a)\nLLMs generally benefit from tools and language feedback, with performance gains\n(absolute, same below) of 1-8% for each turn of tool use and 2-17% with natural\nlanguage feedback. (b) Better single-turn performance does not guarantee better\nmulti-turn performance. (c) Surprisingly, on the LLMs evaluated, supervised\ninstruction-finetuning (SIFT) and reinforcement learning from human feedback\n(RLHF) generally hurt multi-turn capabilities. We expect MINT can help measure\nprogress and incentivize research in improving LLMs' capabilities in multi-turn\ninteractions, especially for open-source communities where multi-turn human\nevaluation can be less accessible compared to commercial LLMs with a larger\nuser base.\n","authors":["Xingyao Wang","Zihan Wang","Jiateng Liu","Yangyi Chen","Lifan Yuan","Hao Peng","Heng Ji"],"pdf_url":"https://arxiv.org/pdf/2309.10691v2.pdf","comment":"Code is available on our project website:\n https://xingyaoww.github.io/mint-bench"},{"id":"http://arxiv.org/abs/2310.08017v1","updated":"2023-10-12T03:33:06Z","published":"2023-10-12T03:33:06Z","title":"Harnessing Large Language Models' Empathetic Response Generation\n Capabilities for Online Mental Health Counselling Support","summary":" Large Language Models (LLMs) have demonstrated remarkable performance across\nvarious information-seeking and reasoning tasks. These computational systems\ndrive state-of-the-art dialogue systems, such as ChatGPT and Bard. They also\ncarry substantial promise in meeting the growing demands of mental health care,\nalbeit relatively unexplored. As such, this study sought to examine LLMs'\ncapability to generate empathetic responses in conversations that emulate those\nin a mental health counselling setting. We selected five LLMs: version 3.5 and\nversion 4 of the Generative Pre-training (GPT), Vicuna FastChat-T5, Pathways\nLanguage Model (PaLM) version 2, and Falcon-7B-Instruct. Based on a simple\ninstructional prompt, these models responded to utterances derived from the\nEmpatheticDialogues (ED) dataset. Using three empathy-related metrics, we\ncompared their responses to those from traditional response generation dialogue\nsystems, which were fine-tuned on the ED dataset, along with human-generated\nresponses. Notably, we discovered that responses from the LLMs were remarkably\nmore empathetic in most scenarios. We position our findings in light of\ncatapulting advancements in creating empathetic conversational systems.\n","authors":["Siyuan Brandon Loh","Aravind Sesagiri Raamkumar"],"pdf_url":"https://arxiv.org/pdf/2310.08017v1.pdf","comment":"7 pages, 1 figure"},{"id":"http://arxiv.org/abs/2310.07644v2","updated":"2023-10-12T03:32:32Z","published":"2023-10-11T16:40:57Z","title":"Rethinking the BERT-like Pretraining for DNA Sequences","summary":" With the success of large-scale pretraining in NLP, there is an increasing\ntrend of applying it to the domain of life sciences. In particular, pretraining\nmethods based on DNA sequences have garnered growing attention due to their\npotential to capture generic information about genes. However, existing\npretraining methods for DNA sequences largely rely on direct adoptions of BERT\npretraining from NLP, lacking a comprehensive understanding and a specifically\ntailored approach. To address this research gap, we first conducted a series of\nexploratory experiments and gained several insightful observations: 1) In the\nfine-tuning phase of downstream tasks, when using K-mer overlapping\ntokenization instead of K-mer non-overlapping tokenization, both overlapping\nand non-overlapping pretraining weights show consistent performance\nimprovement.2) During the pre-training process, using K-mer overlapping\ntokenization quickly produces clear K-mer embeddings and reduces the loss to a\nvery low level, while using K-mer non-overlapping tokenization results in less\ndistinct embeddings and continuously decreases the loss. 3) Using overlapping\ntokenization causes the self-attention in the intermediate layers of\npre-trained models to tend to overly focus on certain tokens, reflecting that\nthese layers are not adequately optimized. In summary, overlapping tokenization\ncan benefit the fine-tuning of downstream tasks but leads to inadequate\npretraining with fast convergence. To unleash the pretraining potential, we\nintroduce a novel approach called RandomMask, which gradually increases the\ntask difficulty of BERT-like pretraining by continuously expanding its mask\nboundary, forcing the model to learn more knowledge. RandomMask is simple but\neffective, achieving top-tier performance across 26 datasets of 28 datasets\nspanning 7 downstream tasks.\n","authors":["Chaoqi Liang","Weiqiang Bai","Lifeng Qiao","Yuchen Ren","Jianle Sun","Peng Ye","Hongliang Yan","Xinzhu Ma","Wangmeng Zuo","Wanli Ouyang"],"pdf_url":"https://arxiv.org/pdf/2310.07644v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.06488v2","updated":"2023-10-12T03:23:40Z","published":"2023-10-10T09:57:17Z","title":"SpikeCLIP: A Contrastive Language-Image Pretrained Spiking Neural\n Network","summary":" Spiking neural networks (SNNs) have demonstrated the capability to achieve\ncomparable performance to deep neural networks (DNNs) in both visual and\nlinguistic domains while offering the advantages of improved energy efficiency\nand adherence to biological plausibility. However, the extension of such\nsingle-modality SNNs into the realm of multimodal scenarios remains an\nunexplored territory. Drawing inspiration from the concept of contrastive\nlanguage-image pre-training (CLIP), we introduce a novel framework, named\nSpikeCLIP, to address the gap between two modalities within the context of\nspike-based computing through a two-step recipe involving ``Alignment\nPre-training + Dual-Loss Fine-tuning\". Extensive experiments demonstrate that\nSNNs achieve comparable results to their DNN counterparts while significantly\nreducing energy consumption across a variety of datasets commonly used for\nmultimodal model evaluation. Furthermore, SpikeCLIP maintains robust\nperformance in image classification tasks that involve class labels not\npredefined within specific categories.\n","authors":["Tianlong Li","Wenhao Liu","Changze Lv","Jianhan Xu","Cenyuan Zhang","Muling Wu","Xiaoqing Zheng","Xuanjing Huang"],"pdf_url":"https://arxiv.org/pdf/2310.06488v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.02285v2","updated":"2023-10-12T03:05:36Z","published":"2023-09-05T14:45:27Z","title":"PromptTTS 2: Describing and Generating Voices with Text Prompt","summary":" Speech conveys more information than text, as the same word can be uttered in\nvarious voices to convey diverse information. Compared to traditional\ntext-to-speech (TTS) methods relying on speech prompts (reference speech) for\nvoice variability, using text prompts (descriptions) is more user-friendly\nsince speech prompts can be hard to find or may not exist at all. TTS\napproaches based on the text prompt face two main challenges: 1) the\none-to-many problem, where not all details about voice variability can be\ndescribed in the text prompt, and 2) the limited availability of text prompt\ndatasets, where vendors and large cost of data labeling are required to write\ntext prompts for speech. In this work, we introduce PromptTTS 2 to address\nthese challenges with a variation network to provide variability information of\nvoice not captured by text prompts, and a prompt generation pipeline to utilize\nthe large language models (LLM) to compose high quality text prompts.\nSpecifically, the variation network predicts the representation extracted from\nthe reference speech (which contains full information about voice variability)\nbased on the text prompt representation. For the prompt generation pipeline, it\ngenerates text prompts for speech with a speech language understanding model to\nrecognize voice attributes (e.g., gender, speed) from speech and a large\nlanguage model to formulate text prompts based on the recognition results.\nExperiments on a large-scale (44K hours) speech dataset demonstrate that\ncompared to the previous works, PromptTTS 2 generates voices more consistent\nwith text prompts and supports the sampling of diverse voice variability,\nthereby offering users more choices on voice generation. Additionally, the\nprompt generation pipeline produces high-quality text prompts, eliminating the\nlarge labeling cost. The demo page of PromptTTS 2 is available online.\n","authors":["Yichong Leng","Zhifang Guo","Kai Shen","Xu Tan","Zeqian Ju","Yanqing Liu","Yufei Liu","Dongchao Yang","Leying Zhang","Kaitao Song","Lei He","Xiang-Yang Li","Sheng Zhao","Tao Qin","Jiang Bian"],"pdf_url":"https://arxiv.org/pdf/2309.02285v2.pdf","comment":"Demo page: https://speechresearch.github.io/prompttts2"},{"id":"http://arxiv.org/abs/2310.04948v2","updated":"2023-10-12T02:43:13Z","published":"2023-10-08T00:02:25Z","title":"TEMPO: Prompt-based Generative Pre-trained Transformer for Time Series\n Forecasting","summary":" The past decade has witnessed significant advances in time series modeling\nwith deep learning. While achieving state-of-the-art results, the\nbest-performing architectures vary highly across applications and domains.\nMeanwhile, for natural language processing, the Generative Pre-trained\nTransformer (GPT) has demonstrated impressive performance via training one\ngeneral-purpose model across various textual datasets. It is intriguing to\nexplore whether GPT-type architectures can be effective for time series,\ncapturing the intrinsic dynamic attributes and leading to significant accuracy\nimprovements. In this paper, we propose a novel framework, TEMPO, that can\neffectively learn time series representations. We focus on utilizing two\nessential inductive biases of the time series task for pre-trained models: (i)\ndecomposition of the complex interaction between trend, seasonal and residual\ncomponents; and (ii) introducing the selection-based prompts to facilitate\ndistribution adaptation in non-stationary time series. TEMPO expands the\ncapability for dynamically modeling real-world temporal phenomena from data\nwithin diverse domains. Our experiments demonstrate the superior performance of\nTEMPO over state-of-the-art methods on a number of time series benchmark\ndatasets. This performance gain is observed not only in standard supervised\nlearning settings but also in scenarios involving previously unseen datasets as\nwell as in scenarios with multi-modal inputs. This compelling finding\nhighlights TEMPO's potential to constitute a foundational model-building\nframework.\n","authors":["Defu Cao","Furong Jia","Sercan O Arik","Tomas Pfister","Yixiang Zheng","Wen Ye","Yan Liu"],"pdf_url":"https://arxiv.org/pdf/2310.04948v2.pdf","comment":"35 pages, 20 figures, 17 tables"},{"id":"http://arxiv.org/abs/2302.12461v2","updated":"2023-10-12T02:29:09Z","published":"2023-02-24T05:26:08Z","title":"Analyzing And Editing Inner Mechanisms Of Backdoored Language Models","summary":" Poisoning of data sets is a potential security threat to large language\nmodels that can lead to backdoored models. A description of the internal\nmechanisms of backdoored language models and how they process trigger inputs,\ne.g., when switching to toxic language, has yet to be found. In this work, we\nstudy the internal representations of transformer-based backdoored language\nmodels and determine early-layer MLP modules as most important for the backdoor\nmechanism in combination with the initial embedding projection. We use this\nknowledge to remove, insert, and modify backdoor mechanisms with engineered\nreplacements that reduce the MLP module outputs to essentials for the backdoor\nmechanism. To this end, we introduce PCP ablation, where we replace transformer\nmodules with low-rank matrices based on the principal components of their\nactivations. We demonstrate our results on backdoored toy, backdoored large,\nand non-backdoored open-source models. We show that we can improve the backdoor\nrobustness of large language models by locally constraining individual modules\nduring fine-tuning on potentially poisonous data sets.\n Trigger warning: Offensive language.\n","authors":["Max Lamparth","Anka Reuel"],"pdf_url":"https://arxiv.org/pdf/2302.12461v2.pdf","comment":"included new experimental results and addressed reviewer feedback"},{"id":"http://arxiv.org/abs/2310.07284v2","updated":"2023-10-12T01:40:37Z","published":"2023-10-11T08:17:54Z","title":"Typing to Listen at the Cocktail Party: Text-Guided Target Speaker\n Extraction","summary":" Humans possess an extraordinary ability to selectively focus on the sound\nsource of interest amidst complex acoustic environments, commonly referred to\nas cocktail party scenarios. In an attempt to replicate this remarkable\nauditory attention capability in machines, target speaker extraction (TSE)\nmodels have been developed. These models leverage the pre-registered cues of\nthe target speaker to extract the sound source of interest. However, the\neffectiveness of these models is hindered in real-world scenarios due to the\nunreliable or even absence of pre-registered cues. To address this limitation,\nthis study investigates the integration of natural language description to\nenhance the feasibility, controllability, and performance of existing TSE\nmodels. Specifically, we propose a model named LLM-TSE, wherein a large\nlanguage model (LLM) to extract useful semantic cues from the user's typed text\ninput. These cues can serve as independent extraction cues, task selectors to\ncontrol the TSE process, or complement the pre-registered cues. Our\nexperimental results demonstrate competitive performance when only text-based\ncues are presented, the effectiveness of using input text as a task selector,\nand a new state-of-the-art when combining text-based cues with pre-registered\ncues. To our knowledge, this is the first study to successfully incorporate\nLLMs to guide target speaker extraction, which can be a cornerstone for\ncocktail party problem research.\n","authors":["Xiang Hao","Jibin Wu","Jianwei Yu","Chenglin Xu","Kay Chen Tan"],"pdf_url":"https://arxiv.org/pdf/2310.07284v2.pdf","comment":"Under review, https://github.com/haoxiangsnr/llm-tse"},{"id":"http://arxiv.org/abs/2310.07968v1","updated":"2023-10-12T01:17:56Z","published":"2023-10-12T01:17:56Z","title":"Think, Act, and Ask: Open-World Interactive Personalized Robot\n Navigation","summary":" Zero-Shot Object Navigation (ZSON) enables agents to navigate towards\nopen-vocabulary objects in unknown environments. The existing works of ZSON\nmainly focus on following individual instructions to find generic object\nclasses, neglecting the utilization of natural language interaction and the\ncomplexities of identifying user-specific objects. To address these\nlimitations, we introduce Zero-shot Interactive Personalized Object Navigation\n(ZIPON), where robots need to navigate to personalized goal objects while\nengaging in conversations with users. To solve ZIPON, we propose a new\nframework termed Open-woRld Interactive persOnalized Navigation (ORION), which\nuses Large Language Models (LLMs) to make sequential decisions to manipulate\ndifferent modules for perception, navigation and communication. Experimental\nresults show that the performance of interactive agents that can leverage user\nfeedback exhibits significant improvement. However, obtaining a good balance\nbetween task completion and the efficiency of navigation and interaction\nremains challenging for all methods. We further provide more findings on the\nimpact of diverse user feedback forms on the agents' performance.\n","authors":["Yinpei Dai","Run Peng","Sikai Li","Joyce Chai"],"pdf_url":"https://arxiv.org/pdf/2310.07968v1.pdf","comment":"Video available at https://www.youtube.com/watch?v=QW6rMHVpxUY"},{"id":"http://arxiv.org/abs/2310.07659v2","updated":"2023-10-12T01:08:39Z","published":"2023-10-11T17:00:29Z","title":"Well Begun is Half Done: Generator-agnostic Knowledge Pre-Selection for\n Knowledge-Grounded Dialogue","summary":" Accurate knowledge selection is critical in knowledge-grounded dialogue\nsystems. Towards a closer look at it, we offer a novel perspective to organize\nexisting literature, i.e., knowledge selection coupled with, after, and before\ngeneration. We focus on the third under-explored category of study, which can\nnot only select knowledge accurately in advance, but has the advantage to\nreduce the learning, adjustment, and interpretation burden of subsequent\nresponse generation models, especially LLMs. We propose GATE, a\ngenerator-agnostic knowledge selection method, to prepare knowledge for\nsubsequent response generation models by selecting context-related knowledge\namong different knowledge structures and variable knowledge requirements.\nExperimental results demonstrate the superiority of GATE, and indicate that\nknowledge selection before generation is a lightweight yet effective way to\nfacilitate LLMs (e.g., ChatGPT) to generate more informative responses.\n","authors":["Lang Qin","Yao Zhang","Hongru Liang","Jun Wang","Zhenglu Yang"],"pdf_url":"https://arxiv.org/pdf/2310.07659v2.pdf","comment":"Accepted by EMNLP2023 main conference"},{"id":"http://arxiv.org/abs/2310.01889v3","updated":"2023-10-12T01:00:09Z","published":"2023-10-03T08:44:50Z","title":"Ring Attention with Blockwise Transformers for Near-Infinite Context","summary":" Transformers have emerged as the architecture of choice for many\nstate-of-the-art AI models, showcasing exceptional performance across a wide\nrange of AI applications. However, the memory demands imposed by Transformers\nlimit their ability to handle long sequences, thereby creating challenges for\ntasks involving extended sequences or long-term dependencies. We present a\ndistinct approach, Ring Attention, which leverages blockwise computation of\nself-attention to distribute long sequences across multiple devices while\noverlapping the communication of key-value blocks with the computation of\nblockwise attention. Ring Attention enables training and inference of sequences\nthat are up to device count times longer than those of prior memory-efficient\nTransformers, effectively eliminating the memory constraints imposed by\nindividual devices. Extensive experiments on language modeling tasks\ndemonstrate the effectiveness of Ring Attention in allowing large sequence\ninput size and improving performance.\n","authors":["Hao Liu","Matei Zaharia","Pieter Abbeel"],"pdf_url":"https://arxiv.org/pdf/2310.01889v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07962v1","updated":"2023-10-12T00:57:32Z","published":"2023-10-12T00:57:32Z","title":"Clustering of Spell Variations for Proper Nouns Transliterated from the\n other languages","summary":" One of the prominent problems with processing and operating on text data is\nthe non uniformity of it. Due to the change in the dialects and languages, the\ncaliber of translation is low. This creates a unique problem while using NLP in\ntext data; which is the spell variation arising from the inconsistent\ntranslations and transliterations. This problem can also be further aggravated\nby the human error arising from the various ways to write a Proper Noun from an\nIndian language into its English equivalent. Translating proper nouns\noriginating from Indian languages can be complicated as some proper nouns are\nalso used as common nouns which might be taken literally. Applications of NLP\nthat require addresses, names and other proper nouns face this problem\nfrequently. We propose a method to cluster these spell variations for proper\nnouns using ML techniques and mathematical similarity equations. We aimed to\nuse Affinity Propagation to determine relative similarity between the tokens.\nThe results are augmented by filtering the token-variation pair by a similarity\nthreshold. We were able to reduce the spell variations by a considerable\namount. This application can significantly reduce the amount of human\nannotation efforts needed for data cleansing and formatting.\n","authors":["Prathamesh Pawar"],"pdf_url":"https://arxiv.org/pdf/2310.07962v1.pdf","comment":"3 pages, published Airial Conference 2023"},{"id":"http://arxiv.org/abs/2310.07957v1","updated":"2023-10-12T00:50:24Z","published":"2023-10-12T00:50:24Z","title":"A New Approach Towards Autoformalization","summary":" Verifying mathematical proofs is difficult, but can be automated with the\nassistance of a computer. Autoformalization is the task of automatically\ntranslating natural language mathematics into a formal language that can be\nverified by a program. This is a challenging task, and especially for\nhigher-level mathematics found in research papers. Research paper mathematics\nrequires large amounts of background and context. In this paper, we propose an\navenue towards tackling autoformalization for research-level mathematics, by\nbreaking the task into easier and more approachable subtasks: unlinked\nformalization (formalization with unlinked definitions and theorems), entity\nlinking (linking to the proper theorems and definitions), and finally adjusting\ntypes so it passes the type checker. In addition, we present arXiv2Formal, a\nbenchmark dataset for unlinked formalization consisting of 50 theorems\nformalized for the Lean theorem prover sampled from papers on arXiv.org. We\nwelcome any contributions from the community to future versions of this\ndataset.\n","authors":["Nilay Patel","Jeffrey Flanigan","Rahul Saha"],"pdf_url":"https://arxiv.org/pdf/2310.07957v1.pdf","comment":"Under review at MATHAI 2023 @ NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.08764v1","updated":"2023-10-12T23:17:56Z","published":"2023-10-12T23:17:56Z","title":"Calibrating Likelihoods towards Consistency in Summarization Models","summary":" Despite the recent advances in abstractive text summarization, current\nsummarization models still suffer from generating factually inconsistent\nsummaries, reducing their utility for real-world application. We argue that the\nmain reason for such behavior is that the summarization models trained with\nmaximum likelihood objective assign high probability to plausible sequences\ngiven the context, but they often do not accurately rank sequences by their\nconsistency. In this work, we solve this problem by calibrating the likelihood\nof model generated sequences to better align with a consistency metric measured\nby natural language inference (NLI) models. The human evaluation study and\nautomatic metrics show that the calibrated models generate more consistent and\nhigher-quality summaries. We also show that the models trained using our method\nreturn probabilities that are better aligned with the NLI scores, which\nsignificantly increase reliability of summarization models.\n","authors":["Polina Zablotskaia","Misha Khalman","Rishabh Joshi","Livio Baldini Soares","Shoshana Jakobovits","Joshua Maynez","Shashi Narayan"],"pdf_url":"https://arxiv.org/pdf/2310.08764v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08753v1","updated":"2023-10-12T22:43:38Z","published":"2023-10-12T22:43:38Z","title":"CompA: Addressing the Gap in Compositional Reasoning in Audio-Language\n Models","summary":" A fundamental characteristic of audio is its compositional nature.\nAudio-language models (ALMs) trained using a contrastive approach (e.g., CLAP)\nthat learns a shared representation between audio and language modalities have\nimproved performance in many downstream applications, including zero-shot audio\nclassification, audio retrieval, etc. However, the ability of these models to\neffectively perform compositional reasoning remains largely unexplored and\nnecessitates additional research. In this paper, we propose CompA, a collection\nof two expert-annotated benchmarks with a majority of real-world audio samples,\nto evaluate compositional reasoning in ALMs. Our proposed CompA-order evaluates\nhow well an ALM understands the order or occurrence of acoustic events in\naudio, and CompA-attribute evaluates attribute binding of acoustic events. An\ninstance from either benchmark consists of two audio-caption pairs, where both\naudios have the same acoustic events but with different compositions. An ALM is\nevaluated on how well it matches the right audio to the right caption. Using\nthis benchmark, we first show that current ALMs perform only marginally better\nthan random chance, thereby struggling with compositional reasoning. Next, we\npropose CompA-CLAP, where we fine-tune CLAP using a novel learning method to\nimprove its compositional reasoning abilities. To train CompA-CLAP, we first\npropose improvements to contrastive training with composition-aware hard\nnegatives, allowing for more focused training. Next, we propose a novel modular\ncontrastive loss that helps the model learn fine-grained compositional\nunderstanding and overcomes the acute scarcity of openly available\ncompositional audios. CompA-CLAP significantly improves over all our baseline\nmodels on the CompA benchmark, indicating its superior compositional reasoning\ncapabilities.\n","authors":["Sreyan Ghosh","Ashish Seth","Sonal Kumar","Utkarsh Tyagi","Chandra Kiran Evuru","S. Ramaneswaran","S. Sakshi","Oriol Nieto","Ramani Duraiswami","Dinesh Manocha"],"pdf_url":"https://arxiv.org/pdf/2310.08753v1.pdf","comment":"Pre-print under review"},{"id":"http://arxiv.org/abs/2310.03328v2","updated":"2023-10-12T22:14:52Z","published":"2023-10-05T05:55:06Z","title":"Reformulating Domain Adaptation of Large Language Models as\n Adapt-Retrieve-Revise","summary":" While large language models (LLMs) like GPT-4 have recently demonstrated\nastonishing zero-shot capabilities in general domain tasks, they often generate\ncontent with hallucinations in specific domains such as Chinese law, hindering\ntheir application in these areas. This is typically due to the absence of\ntraining data that encompasses such a specific domain, preventing GPT-4 from\nacquiring in-domain knowledge. A pressing challenge is that it's not plausible\nto continue training LLMs of such scale on in-domain data.\n This paper introduces a simple and effective domain adaptation framework for\nGPT-4 by reformulating generation as an \\textbf{adapt-retrieve-revise} process.\nThe initial step is to \\textbf{adapt} an affordable 7B LLM to the target domain\nby continuing learning on in-domain data. When solving a task, we leverage the\nadapted LLM to generate a draft answer given a task query. Then, the draft\nanswer will be used to \\textbf{retrieve} supporting evidence candidates from an\nexternal in-domain knowledge base. Finally, the draft answer and retrieved\nevidence are concatenated into a whole prompt to let GPT-4 assess the evidence\nand \\textbf{revise} the draft answer to generate the final answer.\n Our proposal combines the advantages of the efficiency of adapting a smaller\n7B model with the evidence-assessing capability of GPT-4 and effectively\nprevents GPT-4 from generating hallucinatory content. In the zero-shot setting\nof four Chinese legal tasks, our method improves accuracy by 33.3\\% compared to\nthe direct generation by GPT-4. When compared to two stronger retrieval-based\nbaselines, our method outperforms them by 15.4\\% and 23.9\\%. Our code will be\nreleased\n","authors":["Zhen wan","Yating Zhang","Yexiang Wang","Fei Cheng","Sadao Kurohashi"],"pdf_url":"https://arxiv.org/pdf/2310.03328v2.pdf","comment":"Under submission to ICLR 2024"},{"id":"http://arxiv.org/abs/2310.08744v1","updated":"2023-10-12T22:12:28Z","published":"2023-10-12T22:12:28Z","title":"Circuit Component Reuse Across Tasks in Transformer Language Models","summary":" Recent work in mechanistic interpretability has shown that behaviors in\nlanguage models can be successfully reverse-engineered through circuit\nanalysis. A common criticism, however, is that each circuit is task-specific,\nand thus such analysis cannot contribute to understanding the models at a\nhigher level. In this work, we present evidence that insights (both low-level\nfindings about specific heads and higher-level findings about general\nalgorithms) can indeed generalize across tasks. Specifically, we study the\ncircuit discovered in Wang et al. (2022) for the Indirect Object Identification\n(IOI) task and 1.) show that it reproduces on a larger GPT2 model, and 2.) that\nit is mostly reused to solve a seemingly different task: Colored Objects\n(Ippolito & Callison-Burch, 2023). We provide evidence that the process\nunderlying both tasks is functionally very similar, and contains about a 78%\noverlap in in-circuit attention heads. We further present a proof-of-concept\nintervention experiment, in which we adjust four attention heads in middle\nlayers in order to 'repair' the Colored Objects circuit and make it behave like\nthe IOI circuit. In doing so, we boost accuracy from 49.6% to 93.7% on the\nColored Objects task and explain most sources of error. The intervention\naffects downstream attention heads in specific ways predicted by their\ninteractions in the IOI circuit, indicating that this subcircuit behavior is\ninvariant to the different task inputs. Overall, our results provide evidence\nthat it may yet be possible to explain large language models' behavior in terms\nof a relatively small number of interpretable task-general algorithmic building\nblocks and computational components.\n","authors":["Jack Merullo","Carsten Eickhoff","Ellie Pavlick"],"pdf_url":"https://arxiv.org/pdf/2310.08744v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.14600v2","updated":"2023-10-12T21:58:06Z","published":"2023-05-24T00:46:02Z","title":"Learning Semantic Role Labeling from Compatible Label Sequences","summary":" Semantic role labeling (SRL) has multiple disjoint label sets, e.g., VerbNet\nand PropBank. Creating these datasets is challenging, therefore a natural\nquestion is how to use each one to help the other. Prior work has shown that\ncross-task interaction helps, but only explored multitask learning so far. A\ncommon issue with multi-task setup is that argument sequences are still\nseparately decoded, running the risk of generating structurally inconsistent\nlabel sequences (as per lexicons like Semlink). In this paper, we eliminate\nsuch issue with a framework that jointly models VerbNet and PropBank labels as\none sequence. In this setup, we show that enforcing Semlink constraints during\ndecoding constantly improves the overall F1. With special input constructions,\nour joint model infers VerbNet arguments from given PropBank arguments with\nover 99 F1. For learning, we propose a constrained marginal model that learns\nwith knowledge defined in Semlink to further benefit from the large amounts of\nPropBank-only data. On the joint benchmark based on CoNLL05, our models achieve\nstate-of-the-art F1's, outperforming the prior best in-domain model by 3.5\n(VerbNet) and 0.8 (PropBank). For out-of-domain generalization, our models\nsurpass the prior best by 3.4 (VerbNet) and 0.2 (PropBank).\n","authors":["Tao Li","Ghazaleh Kazeminejad","Susan W. Brown","Martha Palmer","Vivek Srikumar"],"pdf_url":"https://arxiv.org/pdf/2305.14600v2.pdf","comment":"Accepted at Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.08740v1","updated":"2023-10-12T21:53:37Z","published":"2023-10-12T21:53:37Z","title":"A Zero-Shot Language Agent for Computer Control with Structured\n Reflection","summary":" Large language models (LLMs) have shown increasing capacity at planning and\nexecuting a high-level goal in a live computer environment (e.g. MiniWoB++). To\nperform a task, recent works often require a model to learn from trace examples\nof the task via either supervised learning or few/many-shot prompting. Without\nthese trace examples, it remains a challenge how an agent can autonomously\nlearn and improve its control on a computer, which limits the ability of an\nagent to perform a new task. We approach this problem with a zero-shot agent\nthat requires no given expert traces. Our agent plans for executable actions on\na partially observed environment, and iteratively progresses a task by\nidentifying and learning from its mistakes via self-reflection and structured\nthought management. On the easy tasks of MiniWoB++, we show that our zero-shot\nagent often outperforms recent SoTAs, with more efficient reasoning. For tasks\nwith more complexity, our reflective agent performs on par with prior best\nmodels, even though previous works had the advantages of accessing expert\ntraces or additional screen information.\n","authors":["Tao Li","Gang Li","Zhiwei Deng","Bryan Wang","Yang Li"],"pdf_url":"https://arxiv.org/pdf/2310.08740v1.pdf","comment":"Accepted at Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2305.16130v2","updated":"2023-10-12T21:43:18Z","published":"2023-05-25T15:04:01Z","title":"A Mechanism for Solving Relational Tasks in Transformer Language Models","summary":" A primary criticism towards language models (LMs) is their inscrutability.\nThis paper presents evidence that, despite their size and complexity, LMs\nsometimes exploit a simple computational mechanism to solve one-to-one\nrelational tasks (e.g., capital_of(Poland)=Warsaw). We investigate a range of\nlanguage model sizes (from 124M parameters to 176B parameters) in an in-context\nlearning setting, and find that for a variety of tasks (involving capital\ncities, upper-casing, and past-tensing) a key part of the mechanism reduces to\na simple linear update typically applied by the feedforward (FFN) networks.\nThese updates also tend to promote the output of the relation in a\ncontent-independent way (e.g., encoding Poland:Warsaw::China:Beijing),\nrevealing a predictable pattern that these models take in solving these tasks.\nWe further show that this mechanism is specific to tasks that require retrieval\nfrom pretraining memory, rather than retrieval from local context. Our results\ncontribute to a growing body of work on the mechanistic interpretability of\nLLMs, and offer reason to be optimistic that, despite the massive and\nnon-linear nature of the models, the strategies they ultimately use to solve\ntasks can sometimes reduce to familiar and even intuitive algorithms.\n","authors":["Jack Merullo","Carsten Eickhoff","Ellie Pavlick"],"pdf_url":"https://arxiv.org/pdf/2305.16130v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.08730v2","updated":"2023-10-12T21:28:02Z","published":"2023-09-15T19:31:40Z","title":"MusiLingo: Bridging Music and Text with Pre-trained Language Models for\n Music Captioning and Query Response","summary":" Large Language Models (LLMs) have shown immense potential in multimodal\napplications, yet the convergence of textual and musical domains remains\nrelatively unexplored. To address this gap, we present MusiLingo, a novel\nsystem for music caption generation and music-related query responses.\nMusiLingo employs a single projection layer to align music representations from\nthe pre-trained frozen music audio model MERT with the frozen Vicuna-7B\nlanguage model (an adaption of LLaMA), bridging the gap between music audio and\ntextual contexts. We train it on an extensive music caption dataset and\nfine-tune it with instructional data. Due to the scarcity of high-quality music\nQ\\&A datasets, we created the Music Instruct (MI) dataset from captions in the\nMusicCaps datasets, tailored for open-ended music inquiries. Empirical\nevaluations demonstrate its competitive performance in generating music\ncaptions and composing music-related Q&A pairs.\n","authors":["Zihao Deng","Yinghao Ma","Yudong Liu","Rongchen Guo","Ge Zhang","Wenhu Chen","Wenhao Huang","Emmanouil Benetos"],"pdf_url":"https://arxiv.org/pdf/2309.08730v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04711v3","updated":"2023-10-12T21:25:15Z","published":"2023-08-09T05:06:39Z","title":"Answering Unseen Questions With Smaller Language Models Using Rationale\n Generation and Dense Retrieval","summary":" When provided with sufficient explanatory context, smaller Language Models\nhave been shown to exhibit strong reasoning ability on challenging short-answer\nquestion-answering tasks where the questions are unseen in training. We\nevaluate two methods for further improvement in this setting. Both methods\nfocus on combining rationales generated by a larger Language Model with longer\ncontexts created from a multi-hop dense retrieval system. The first method\n($\\textit{RR}$) involves training a Rationale Ranking model to score both\ngenerated rationales and retrieved contexts with respect to relevance and\ntruthfulness. We then use the scores to derive combined contexts from both\nknowledge sources using a number of combinatory strategies. For the second\nmethod ($\\textit{RATD}$) we utilise retrieval-augmented training datasets\ndeveloped by Hartill et al. 2023 to train a smaller Reasoning model such that\nit becomes proficient at utilising relevant information from longer text\nsequences that may be only partially evidential and frequently contain many\nirrelevant sentences. We find that both methods significantly improve results.\nOur single best Reasoning model materially improves upon strong comparable\nprior baselines for unseen evaluation datasets (StrategyQA 58.9 $\\rightarrow$\n61.7 acc., CommonsenseQA 63.6 $\\rightarrow$ 72.7 acc., ARC-DA 31.6\n$\\rightarrow$ 52.1 F1, IIRC 25.5 $\\rightarrow$ 27.3 F1) and a version utilising\nour prior knowledge of each type of question in selecting a context combination\nstrategy does even better. Our proposed models also generally outperform direct\nprompts against much larger models (BLOOM 175B and StableVicuna 13B) in both\nfew-shot chain-of-thought and standard few-shot settings.\n","authors":["Tim Hartill","Diana Benavides-Prado","Michael Witbrock","Patricia J. Riddle"],"pdf_url":"https://arxiv.org/pdf/2308.04711v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.14707v2","updated":"2023-10-12T21:13:09Z","published":"2023-05-24T04:24:16Z","title":"SciFix: Outperforming GPT3 on Scientific Factual Error Correction","summary":" Due to the prohibitively high cost of creating error correction datasets,\nmost Factual Claim Correction methods rely on a powerful verification model to\nguide the correction process. This leads to a significant drop in performance\nin domains like scientific claims, where good verification models do not always\nexist. In this work, we introduce SciFix, a scientific claim correction system\nthat does not require a verifier but can outperform existing methods by a\nconsiderable margin -- achieving correction accuracy of 84% on the SciFact\ndataset, 77% on SciFact-Open and 72% on the CovidFact dataset, compared to next\nbest accuracies of 7%, 5%, and 15% on the same datasets respectively. Our\nmethod leverages the power of prompting with LLMs during training to create a\nrichly annotated dataset that can be used for fully supervised training and\nregularization. We additionally use a claim-aware decoding procedure to improve\nthe quality of corrected claims. Our method outperforms the very LLM that was\nused to generate the annotated dataset -- with Few-Shot Prompting on GPT3.5\nachieving 58%, 61%, and 64% on the respective datasets, a consistently lower\ncorrection accuracy, despite using nearly 800 times as many parameters as our\nmodel.\n","authors":["Dhananjay Ashok","Atharva Kulkarni","Hai Pham","Barnabás Póczos"],"pdf_url":"https://arxiv.org/pdf/2305.14707v2.pdf","comment":"To appear in proceedings of EMNLP2023 (findings)"},{"id":"http://arxiv.org/abs/2310.08715v1","updated":"2023-10-12T20:53:39Z","published":"2023-10-12T20:53:39Z","title":"Toward Joint Language Modeling for Speech Units and Text","summary":" Speech and text are two major forms of human language. The research community\nhas been focusing on mapping speech to text or vice versa for many years.\nHowever, in the field of language modeling, very little effort has been made to\nmodel them jointly. In light of this, we explore joint language modeling for\nspeech units and text. Specifically, we compare different speech tokenizers to\ntransform continuous speech signals into discrete units and use different\nmethods to construct mixed speech-text data. We introduce automatic metrics to\nevaluate how well the joint LM mixes speech and text. We also fine-tune the LM\non downstream spoken language understanding (SLU) tasks with different\nmodalities (speech or text) and test its performance to assess the model's\nlearning of shared representations. Our results show that by mixing speech\nunits and text with our proposed mixing techniques, the joint LM improves over\na speech-only baseline on SLU tasks and shows zero-shot cross-modal\ntransferability.\n","authors":["Ju-Chieh Chou","Chung-Ming Chien","Wei-Ning Hsu","Karen Livescu","Arun Babu","Alexis Conneau","Alexei Baevski","Michael Auli"],"pdf_url":"https://arxiv.org/pdf/2310.08715v1.pdf","comment":"EMNLP findings 2023"},{"id":"http://arxiv.org/abs/2310.01248v2","updated":"2023-10-12T20:10:37Z","published":"2023-10-02T14:32:07Z","title":"Improving Emotional Expression and Cohesion in Image-Based Playlist\n Description and Music Topics: A Continuous Parameterization Approach","summary":" Text generation in image-based platforms, particularly for music-related\ncontent, requires precise control over text styles and the incorporation of\nemotional expression. However, existing approaches often need help to control\nthe proportion of external factors in generated text and rely on discrete\ninputs, lacking continuous control conditions for desired text generation. This\nstudy proposes Continuous Parameterization for Controlled Text Generation\n(CPCTG) to overcome these limitations. Our approach leverages a Language Model\n(LM) as a style learner, integrating Semantic Cohesion (SC) and Emotional\nExpression Proportion (EEP) considerations. By enhancing the reward method and\nmanipulating the CPCTG level, our experiments on playlist description and music\ntopic generation tasks demonstrate significant improvements in ROUGE scores,\nindicating enhanced relevance and coherence in the generated text.\n","authors":["Yuelyu Ji","Yuheng Song","Wei Wang","Ruoyi Xu","Zhongqian Xie","Huiyun Liu"],"pdf_url":"https://arxiv.org/pdf/2310.01248v2.pdf","comment":"Becasue I find some important fourmulation need to change"},{"id":"http://arxiv.org/abs/2310.08678v1","updated":"2023-10-12T19:28:57Z","published":"2023-10-12T19:28:57Z","title":"Can GPT models be Financial Analysts? An Evaluation of ChatGPT and GPT-4\n on mock CFA Exams","summary":" Large Language Models (LLMs) have demonstrated remarkable performance on a\nwide range of Natural Language Processing (NLP) tasks, often matching or even\nbeating state-of-the-art task-specific models. This study aims at assessing the\nfinancial reasoning capabilities of LLMs. We leverage mock exam questions of\nthe Chartered Financial Analyst (CFA) Program to conduct a comprehensive\nevaluation of ChatGPT and GPT-4 in financial analysis, considering Zero-Shot\n(ZS), Chain-of-Thought (CoT), and Few-Shot (FS) scenarios. We present an\nin-depth analysis of the models' performance and limitations, and estimate\nwhether they would have a chance at passing the CFA exams. Finally, we outline\ninsights into potential strategies and improvements to enhance the\napplicability of LLMs in finance. In this perspective, we hope this work paves\nthe way for future studies to continue enhancing LLMs for financial reasoning\nthrough rigorous evaluation.\n","authors":["Ethan Callanan","Amarachi Mbakwe","Antony Papadimitriou","Yulong Pei","Mathieu Sibue","Xiaodan Zhu","Zhiqiang Ma","Xiaomo Liu","Sameena Shah"],"pdf_url":"https://arxiv.org/pdf/2310.08678v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.09849v5","updated":"2023-10-12T19:02:38Z","published":"2022-12-19T20:46:43Z","title":"Dataless Knowledge Fusion by Merging Weights of Language Models","summary":" Fine-tuning pre-trained language models has become the prevalent paradigm for\nbuilding downstream NLP models. Oftentimes fine-tuned models are readily\navailable but their training data is not, due to data privacy or intellectual\nproperty concerns. This creates a barrier to fusing knowledge across individual\nmodels to yield a better single model. In this paper, we study the problem of\nmerging individual models built on different training data sets to obtain a\nsingle model that performs well both across all data set domains and can\ngeneralize on out-of-domain data. We propose a dataless knowledge fusion method\nthat merges models in their parameter space, guided by weights that minimize\nprediction differences between the merged model and the individual models. Over\na battery of evaluation settings, we show that the proposed method\nsignificantly outperforms baselines such as Fisher-weighted averaging or model\nensembling. Further, we find that our method is a promising alternative to\nmulti-task learning that can preserve or sometimes improve over the individual\nmodels without access to the training data. Finally, model merging is more\nefficient than training a multi-task model, thus making it applicable to a\nwider set of scenarios.\n","authors":["Xisen Jin","Xiang Ren","Daniel Preotiuc-Pietro","Pengxiang Cheng"],"pdf_url":"https://arxiv.org/pdf/2212.09849v5.pdf","comment":"ICLR 2023; The code is available at\n https://github.com/bloomberg/dataless-model-merging"},{"id":"http://arxiv.org/abs/1806.07722v3","updated":"2023-10-12T18:51:54Z","published":"2018-06-15T09:29:05Z","title":"Stylized innovation: generating timelines by interrogating incrementally\n available randomised dictionaries","summary":" A key challenge when trying to understand innovation is that it is a dynamic,\nongoing process, which can be highly contingent on ephemeral factors such as\nculture, economics, or luck. This means that any analysis of the real-world\nprocess must necessarily be historical - and thus probably too late to be most\nuseful - but also cannot be sure what the properties of the web of connections\nbetween innovations is or was. Here I try to address this by designing and\ngenerating a set of synthetic innovation web \"dictionaries\" that can be used to\nhost sampled innovation timelines, probe the overall statistics and behaviours\nof these processes, and determine the degree of their reliance on the structure\nor generating algorithm. Thus, inspired by the work of Fink, Reeves, Palma and\nFarr (2017) on innovation in language, gastronomy, and technology, I study how\nnew symbol discovery manifests itself in terms of additional \"word\" vocabulary\nbeing available from dictionaries generated from a finite number of symbols.\nSeveral distinct dictionary generation models are investigated using numerical\nsimulation, with emphasis on the scaling of knowledge as dictionary generators\nand parameters are varied, and the role of which order the symbols are\ndiscovered in.\n","authors":["Paul Kinsler"],"pdf_url":"https://arxiv.org/pdf/1806.07722v3.pdf","comment":"12 pages"},{"id":"http://arxiv.org/abs/2302.04863v3","updated":"2023-10-12T18:42:34Z","published":"2023-02-09T18:59:18Z","title":"Knowledge is a Region in Weight Space for Fine-tuned Language Models","summary":" Research on neural networks has focused on understanding a single model\ntrained on a single dataset. However, relatively little is known about the\nrelationships between different models, particularly those trained or tested on\ndifferent datasets. We address this by studying how the weight space and the\nunderlying loss landscape of different models are interconnected.\n Specifically, we demonstrate that finetuned models that were optimized for\nhigh performance, reside in well-defined regions in weight space, and vice\nversa -- that any model that resides anywhere in those regions also exhibits\nhigh performance. Notably, we show that language models that have been\nfinetuned on the same dataset form a tight cluster in the weight space, while\nmodels finetuned on different datasets from the same underlying task form a\nlooser cluster. Moreover, traversing around the region between the models leads\nto new models that perform comparably or even better than models obtained via\nfinetuning, even on tasks that the original models were not finetuned on.\n Our findings provide insight into the relationships between models,\ndemonstrating that a model positioned between two similar models can acquire\nthe knowledge of both. We leverage this and design a method for selecting a\nbetter model for efficient finetuning. Specifically, we show that starting from\nthe center of the region is as effective, if not more, than using the\npretrained model in 11 out of 12 datasets, resulting in an average accuracy\nimprovement of 3.06.\n","authors":["Almog Gueta","Elad Venezian","Colin Raffel","Noam Slonim","Yoav Katz","Leshem Choshen"],"pdf_url":"https://arxiv.org/pdf/2302.04863v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.04668v2","updated":"2023-10-12T18:34:08Z","published":"2023-10-07T03:14:11Z","title":"Label-free Node Classification on Graphs with Large Language Models\n (LLMS)","summary":" In recent years, there have been remarkable advancements in node\nclassification achieved by Graph Neural Networks (GNNs). However, they\nnecessitate abundant high-quality labels to ensure promising performance. In\ncontrast, Large Language Models (LLMs) exhibit impressive zero-shot proficiency\non text-attributed graphs. Yet, they face challenges in efficiently processing\nstructural data and suffer from high inference costs. In light of these\nobservations, this work introduces a label-free node classification on graphs\nwith LLMs pipeline, LLM-GNN. It amalgamates the strengths of both GNNs and LLMs\nwhile mitigating their limitations. Specifically, LLMs are leveraged to\nannotate a small portion of nodes and then GNNs are trained on LLMs'\nannotations to make predictions for the remaining large portion of nodes. The\nimplementation of LLM-GNN faces a unique challenge: how can we actively select\nnodes for LLMs to annotate and consequently enhance the GNN training? How can\nwe leverage LLMs to obtain annotations of high quality, representativeness, and\ndiversity, thereby enhancing GNN performance with less cost? To tackle this\nchallenge, we develop an annotation quality heuristic and leverage the\nconfidence scores derived from LLMs to advanced node selection. Comprehensive\nexperimental results validate the effectiveness of LLM-GNN. In particular,\nLLM-GNN can achieve an accuracy of 74.9% on a vast-scale dataset \\products with\na cost less than 1 dollar.\n","authors":["Zhikai Chen","Haitao Mao","Hongzhi Wen","Haoyu Han","Wei Jin","Haiyang Zhang","Hui Liu","Jiliang Tang"],"pdf_url":"https://arxiv.org/pdf/2310.04668v2.pdf","comment":"The code will be available soon via\n https://github.com/CurryTang/LLMGNN"},{"id":"http://arxiv.org/abs/2310.08659v1","updated":"2023-10-12T18:34:08Z","published":"2023-10-12T18:34:08Z","title":"LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models","summary":" Quantization is an indispensable technique for serving Large Language Models\n(LLMs) and has recently found its way into LoRA fine-tuning. In this work we\nfocus on the scenario where quantization and LoRA fine-tuning are applied\ntogether on a pre-trained model. In such cases it is common to observe a\nconsistent gap in the performance on downstream tasks between full fine-tuning\nand quantization plus LoRA fine-tuning approach. In response, we propose LoftQ\n(LoRA-Fine-Tuning-aware Quantization), a novel quantization framework that\nsimultaneously quantizes an LLM and finds a proper low-rank initialization for\nLoRA fine-tuning. Such an initialization alleviates the discrepancy between the\nquantized and full-precision model and significantly improves the\ngeneralization in downstream tasks. We evaluate our method on natural language\nunderstanding, question answering, summarization, and natural language\ngeneration tasks. Experiments show that our method is highly effective and\noutperforms existing quantization methods, especially in the challenging 2-bit\nand 2/4-bit mixed precision regimes. We will release our code.\n","authors":["Yixiao Li","Yifan Yu","Chen Liang","Pengcheng He","Nikos Karampatziakis","Weizhu Chen","Tuo Zhao"],"pdf_url":"https://arxiv.org/pdf/2310.08659v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.05280v3","updated":"2023-10-12T18:28:09Z","published":"2023-10-08T21:03:18Z","title":"Are Personalized Stochastic Parrots More Dangerous? Evaluating Persona\n Biases in Dialogue Systems","summary":" Recent advancements in Large Language Models empower them to follow freeform\ninstructions, including imitating generic or specific demographic personas in\nconversations. Generic personas refer to an individual from a demographic group\n(e.g. an Asian person), whereas specific personas can be actual names of\nhistorical figures. While the adoption of personas allows dialogue systems to\nbe more engaging and approachable to users, it also carries the potential risk\nof exacerbating social biases in model responses, further causing societal\nharms through interactions with users. In this paper, we systematically study\n\"persona biases\", which we define to be the sensitivity of harmful dialogue\nmodel behaviors to different persona adoptions. We categorize persona biases\ninto biases in harmful expression and harmful agreement, as well as establish a\ncomprehensive evaluation framework to measure persona biases in five aspects:\nOffensiveness, Toxic Continuation, Regard, Stereotype Agreement, and Toxic\nAgreement. Additionally, we propose to comprehensively investigate persona\nbiases through experimenting with UniversalPersona, a systematized persona\ndataset with a comprehensive list of both generic and specific model personas.\nThrough benchmarking on four different models, including Blender, ChatGPT,\nAlpaca, and Vicuna, our study uncovers significant persona biases in these\ndialogue systems.Findings of our study underscores the immediate need to\nrevisit the use of persona traits in dialogue agents, to ensure their safe\napplication.\n","authors":["Yixin Wan","Jieyu Zhao","Aman Chadha","Nanyun Peng","Kai-Wei Chang"],"pdf_url":"https://arxiv.org/pdf/2310.05280v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.05185v2","updated":"2023-10-12T18:26:46Z","published":"2023-10-08T14:47:13Z","title":"Text2NKG: Fine-Grained N-ary Relation Extraction for N-ary relational\n Knowledge Graph Construction","summary":" Beyond traditional binary relational facts, n-ary relational knowledge graphs\n(NKGs) are comprised of n-ary relational facts containing more than two\nentities, which are closer to real-world facts with broader applications.\nHowever, the construction of NKGs still significantly relies on manual labor,\nand n-ary relation extraction still remains at a course-grained level, which is\nalways in a single schema and fixed arity of entities. To address these\nrestrictions, we propose Text2NKG, a novel fine-grained n-ary relation\nextraction framework for n-ary relational knowledge graph construction. We\nintroduce a span-tuple classification approach with hetero-ordered merging to\naccomplish fine-grained n-ary relation extraction in different arity.\nFurthermore, Text2NKG supports four typical NKG schemas: hyper-relational\nschema, event-based schema, role-based schema, and hypergraph-based schema,\nwith high flexibility and practicality. Experimental results demonstrate that\nText2NKG outperforms the previous state-of-the-art model by nearly 20\\% points\nin the $F_1$ scores on the fine-grained n-ary relation extraction benchmark in\nthe hyper-relational schema. Our code and datasets are publicly available.\n","authors":["Haoran Luo","Haihong E","Yuhao Yang","Tianyu Yao","Yikai Guo","Zichen Tang","Wentai Zhang","Kaiyang Wan","Shiyao Peng","Meina Song","Wei Lin"],"pdf_url":"https://arxiv.org/pdf/2310.05185v2.pdf","comment":"Preprint"}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2310.08587v1","updated":"2023-10-12T17:59:58Z","published":"2023-10-12T17:59:58Z","title":"Is Generalized Dynamic Novel View Synthesis from Monocular Videos\n Possible Today?","summary":" Rendering scenes observed in a monocular video from novel viewpoints is a\nchallenging problem. For static scenes the community has studied both\nscene-specific optimization techniques, which optimize on every test scene, and\ngeneralized techniques, which only run a deep net forward pass on a test scene.\nIn contrast, for dynamic scenes, scene-specific optimization techniques exist,\nbut, to our best knowledge, there is currently no generalized method for\ndynamic novel view synthesis from a given monocular video. To answer whether\ngeneralized dynamic novel view synthesis from monocular videos is possible\ntoday, we establish an analysis framework based on existing techniques and work\ntoward the generalized approach. We find a pseudo-generalized process without\nscene-specific appearance optimization is possible, but geometrically and\ntemporally consistent depth estimates are needed. Despite no scene-specific\nappearance optimization, the pseudo-generalized approach improves upon some\nscene-specific methods.\n","authors":["Xiaoming Zhao","Alex Colburn","Fangchang Ma","Miguel Angel Bautista","Joshua M. Susskind","Alexander G. Schwing"],"pdf_url":"https://arxiv.org/pdf/2310.08587v1.pdf","comment":"Project page: https://xiaoming-zhao.github.io/projects/pgdvs"},{"id":"http://arxiv.org/abs/2310.08588v1","updated":"2023-10-12T17:59:58Z","published":"2023-10-12T17:59:58Z","title":"Octopus: Embodied Vision-Language Programmer from Environmental Feedback","summary":" Large vision-language models (VLMs) have achieved substantial progress in\nmultimodal perception and reasoning. Furthermore, when seamlessly integrated\ninto an embodied agent, it signifies a crucial stride towards the creation of\nautonomous and context-aware systems capable of formulating plans and executing\ncommands with precision. In this paper, we introduce Octopus, a novel VLM\ndesigned to proficiently decipher an agent's vision and textual task objectives\nand to formulate intricate action sequences and generate executable code. Our\ndesign allows the agent to adeptly handle a wide spectrum of tasks, ranging\nfrom mundane daily chores in simulators to sophisticated interactions in\ncomplex video games. Octopus is trained by leveraging GPT-4 to control an\nexplorative agent to generate training data, i.e., action blueprints and the\ncorresponding executable code, within our experimental environment called\nOctoVerse. We also collect the feedback that allows the enhanced training\nscheme of Reinforcement Learning with Environmental Feedback (RLEF). Through a\nseries of experiments, we illuminate Octopus's functionality and present\ncompelling results, and the proposed RLEF turns out to refine the agent's\ndecision-making. By open-sourcing our model architecture, simulator, and\ndataset, we aspire to ignite further innovation and foster collaborative\napplications within the broader embodied AI community.\n","authors":["Jingkang Yang","Yuhao Dong","Shuai Liu","Bo Li","Ziyue Wang","Chencheng Jiang","Haoran Tan","Jiamu Kang","Yuanhan Zhang","Kaiyang Zhou","Ziwei Liu"],"pdf_url":"https://arxiv.org/pdf/2310.08588v1.pdf","comment":"Project Page: https://choiszt.github.io/Octopus/, Codebase:\n https://github.com/dongyh20/Octopus"},{"id":"http://arxiv.org/abs/2310.08585v1","updated":"2023-10-12T17:59:57Z","published":"2023-10-12T17:59:57Z","title":"Im4D: High-Fidelity and Real-Time Novel View Synthesis for Dynamic\n Scenes","summary":" This paper aims to tackle the challenge of dynamic view synthesis from\nmulti-view videos. The key observation is that while previous grid-based\nmethods offer consistent rendering, they fall short in capturing appearance\ndetails of a complex dynamic scene, a domain where multi-view image-based\nrendering methods demonstrate the opposite properties. To combine the best of\ntwo worlds, we introduce Im4D, a hybrid scene representation that consists of a\ngrid-based geometry representation and a multi-view image-based appearance\nrepresentation. Specifically, the dynamic geometry is encoded as a 4D density\nfunction composed of spatiotemporal feature planes and a small MLP network,\nwhich globally models the scene structure and facilitates the rendering\nconsistency. We represent the scene appearance by the original multi-view\nvideos and a network that learns to predict the color of a 3D point from image\nfeatures, instead of memorizing detailed appearance totally with networks,\nthereby naturally making the learning of networks easier. Our method is\nevaluated on five dynamic view synthesis datasets including DyNeRF, ZJU-MoCap,\nNHR, DNA-Rendering and ENeRF-Outdoor datasets. The results show that Im4D\nexhibits state-of-the-art performance in rendering quality and can be trained\nefficiently, while realizing real-time rendering with a speed of 79.8 FPS for\n512x512 images, on a single RTX 3090 GPU.\n","authors":["Haotong Lin","Sida Peng","Zhen Xu","Tao Xie","Xingyi He","Hujun Bao","Xiaowei Zhou"],"pdf_url":"https://arxiv.org/pdf/2310.08585v1.pdf","comment":"SIGGRAPH Asia 2023; Project page: https://zju3dv.github.io/im4d"},{"id":"http://arxiv.org/abs/2310.08586v1","updated":"2023-10-12T17:59:57Z","published":"2023-10-12T17:59:57Z","title":"PonderV2: Pave the Way for 3D Foundataion Model with A Universal\n Pre-training Paradigm","summary":" In contrast to numerous NLP and 2D computer vision foundational models, the\nlearning of a robust and highly generalized 3D foundational model poses\nconsiderably greater challenges. This is primarily due to the inherent data\nvariability and the diversity of downstream tasks. In this paper, we introduce\na comprehensive 3D pre-training framework designed to facilitate the\nacquisition of efficient 3D representations, thereby establishing a pathway to\n3D foundational models. Motivated by the fact that informative 3D features\nshould be able to encode rich geometry and appearance cues that can be utilized\nto render realistic images, we propose a novel universal paradigm to learn\npoint cloud representations by differentiable neural rendering, serving as a\nbridge between 3D and 2D worlds. We train a point cloud encoder within a\ndevised volumetric neural renderer by comparing the rendered images with the\nreal images. Notably, our approach demonstrates the seamless integration of the\nlearned 3D encoder into diverse downstream tasks. These tasks encompass not\nonly high-level challenges such as 3D detection and segmentation but also\nlow-level objectives like 3D reconstruction and image synthesis, spanning both\nindoor and outdoor scenarios. Besides, we also illustrate the capability of\npre-training a 2D backbone using the proposed universal methodology, surpassing\nconventional pre-training methods by a large margin. For the first time,\n\\sexyname achieves state-of-the-art performance on 11 indoor and outdoor\nbenchmarks. The consistent improvements in various settings imply the\neffectiveness of the proposed method. Code and models will be made available at\nhttps://github.com/Pointcept/Pointcept.\n","authors":["Haoyi Zhu","Honghui Yang","Xiaoyang Wu","Di Huang","Sha Zhang","Xianglong He","Tong He","Hengshuang Zhao","Chunhua Shen","Yu Qiao","Wanli Ouyang"],"pdf_url":"https://arxiv.org/pdf/2310.08586v1.pdf","comment":"arXiv admin note: text overlap with arXiv:2301.00157"},{"id":"http://arxiv.org/abs/2310.08584v1","updated":"2023-10-12T17:59:55Z","published":"2023-10-12T17:59:55Z","title":"Is ImageNet worth 1 video? Learning strong image encoders from 1 long\n unlabelled video","summary":" Self-supervised learning has unlocked the potential of scaling up pretraining\nto billions of images, since annotation is unnecessary. But are we making the\nbest use of data? How more economical can we be? In this work, we attempt to\nanswer this question by making two contributions. First, we investigate\nfirst-person videos and introduce a \"Walking Tours\" dataset. These videos are\nhigh-resolution, hours-long, captured in a single uninterrupted take, depicting\na large number of objects and actions with natural scene transitions. They are\nunlabeled and uncurated, thus realistic for self-supervision and comparable\nwith human learning.\n Second, we introduce a novel self-supervised image pretraining method\ntailored for learning from continuous videos. Existing methods typically adapt\nimage-based pretraining approaches to incorporate more frames. Instead, we\nadvocate a \"tracking to learn to recognize\" approach. Our method called DoRA,\nleads to attention maps that Discover and tRAck objects over time in an\nend-to-end manner, using transformer cross-attention. We derive multiple views\nfrom the tracks and use them in a classical self-supervised distillation loss.\nUsing our novel approach, a single Walking Tours video remarkably becomes a\nstrong competitor to ImageNet for several image and video downstream tasks.\n","authors":["Shashanka Venkataramanan","Mamshad Nayeem Rizve","João Carreira","Yuki M. Asano","Yannis Avrithis"],"pdf_url":"https://arxiv.org/pdf/2310.08584v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08581v1","updated":"2023-10-12T17:59:41Z","published":"2023-10-12T17:59:41Z","title":"Universal Visual Decomposer: Long-Horizon Manipulation Made Easy","summary":" Real-world robotic tasks stretch over extended horizons and encompass\nmultiple stages. Learning long-horizon manipulation tasks, however, is a\nlong-standing challenge, and demands decomposing the overarching task into\nseveral manageable subtasks to facilitate policy learning and generalization to\nunseen tasks. Prior task decomposition methods require task-specific knowledge,\nare computationally intensive, and cannot readily be applied to new tasks. To\naddress these shortcomings, we propose Universal Visual Decomposer (UVD), an\noff-the-shelf task decomposition method for visual long horizon manipulation\nusing pre-trained visual representations designed for robotic control. At a\nhigh level, UVD discovers subgoals by detecting phase shifts in the embedding\nspace of the pre-trained representation. Operating purely on visual\ndemonstrations without auxiliary information, UVD can effectively extract\nvisual subgoals embedded in the videos, while incurring zero additional\ntraining cost on top of standard visuomotor policy training. Goal-conditioned\npolicies learned with UVD-discovered subgoals exhibit significantly improved\ncompositional generalization at test time to unseen tasks. Furthermore,\nUVD-discovered subgoals can be used to construct goal-based reward shaping that\njump-starts temporally extended exploration for reinforcement learning. We\nextensively evaluate UVD on both simulation and real-world tasks, and in all\ncases, UVD substantially outperforms baselines across imitation and\nreinforcement learning settings on in-domain and out-of-domain task sequences\nalike, validating the clear advantage of automated visual task decomposition\nwithin the simple, compact UVD framework.\n","authors":["Zichen Zhang","Yunshuang Li","Osbert Bastani","Abhishek Gupta","Dinesh Jayaraman","Yecheng Jason Ma","Luca Weihs"],"pdf_url":"https://arxiv.org/pdf/2310.08581v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08580v1","updated":"2023-10-12T17:59:38Z","published":"2023-10-12T17:59:38Z","title":"OmniControl: Control Any Joint at Any Time for Human Motion Generation","summary":" We present a novel approach named OmniControl for incorporating flexible\nspatial control signals into a text-conditioned human motion generation model\nbased on the diffusion process. Unlike previous methods that can only control\nthe pelvis trajectory, OmniControl can incorporate flexible spatial control\nsignals over different joints at different times with only one model.\nSpecifically, we propose analytic spatial guidance that ensures the generated\nmotion can tightly conform to the input control signals. At the same time,\nrealism guidance is introduced to refine all the joints to generate more\ncoherent motion. Both the spatial and realism guidance are essential and they\nare highly complementary for balancing control accuracy and motion realism. By\ncombining them, OmniControl generates motions that are realistic, coherent, and\nconsistent with the spatial constraints. Experiments on HumanML3D and KIT-ML\ndatasets show that OmniControl not only achieves significant improvement over\nstate-of-the-art methods on pelvis control but also shows promising results\nwhen incorporating the constraints over other joints.\n","authors":["Yiming Xie","Varun Jampani","Lei Zhong","Deqing Sun","Huaizu Jiang"],"pdf_url":"https://arxiv.org/pdf/2310.08580v1.pdf","comment":"Project page: https://neu-vi.github.io/omnicontrol/"},{"id":"http://arxiv.org/abs/2310.08579v1","updated":"2023-10-12T17:59:34Z","published":"2023-10-12T17:59:34Z","title":"HyperHuman: Hyper-Realistic Human Generation with Latent Structural\n Diffusion","summary":" Despite significant advances in large-scale text-to-image models, achieving\nhyper-realistic human image generation remains a desirable yet unsolved task.\nExisting models like Stable Diffusion and DALL-E 2 tend to generate human\nimages with incoherent parts or unnatural poses. To tackle these challenges,\nour key insight is that human image is inherently structural over multiple\ngranularities, from the coarse-level body skeleton to fine-grained spatial\ngeometry. Therefore, capturing such correlations between the explicit\nappearance and latent structure in one model is essential to generate coherent\nand natural human images. To this end, we propose a unified framework,\nHyperHuman, that generates in-the-wild human images of high realism and diverse\nlayouts. Specifically, 1) we first build a large-scale human-centric dataset,\nnamed HumanVerse, which consists of 340M images with comprehensive annotations\nlike human pose, depth, and surface normal. 2) Next, we propose a Latent\nStructural Diffusion Model that simultaneously denoises the depth and surface\nnormal along with the synthesized RGB image. Our model enforces the joint\nlearning of image appearance, spatial relationship, and geometry in a unified\nnetwork, where each branch in the model complements to each other with both\nstructural awareness and textural richness. 3) Finally, to further boost the\nvisual quality, we propose a Structure-Guided Refiner to compose the predicted\nconditions for more detailed generation of higher resolution. Extensive\nexperiments demonstrate that our framework yields the state-of-the-art\nperformance, generating hyper-realistic human images under diverse scenarios.\nProject Page: https://snap-research.github.io/HyperHuman/\n","authors":["Xian Liu","Jian Ren","Aliaksandr Siarohin","Ivan Skorokhodov","Yanyu Li","Dahua Lin","Xihui Liu","Ziwei Liu","Sergey Tulyakov"],"pdf_url":"https://arxiv.org/pdf/2310.08579v1.pdf","comment":"Project Page: https://snap-research.github.io/HyperHuman/"},{"id":"http://arxiv.org/abs/2310.08577v1","updated":"2023-10-12T17:59:30Z","published":"2023-10-12T17:59:30Z","title":"Visual Data-Type Understanding does not emerge from Scaling\n Vision-Language Models","summary":" Recent advances in the development of vision-language models (VLMs) are\nyielding remarkable success in recognizing visual semantic content, including\nimpressive instances of compositional image understanding. Here, we introduce\nthe novel task of \\textit{Visual Data-Type Identification}, a basic perceptual\nskill with implications for data curation (e.g., noisy data-removal from large\ndatasets, domain-specific retrieval) and autonomous vision (e.g.,\ndistinguishing changing weather conditions from camera lens staining). We\ndevelop two datasets consisting of animal images altered across a diverse set\nof 27 visual \\textit{data-types}, spanning four broad categories. An extensive\nzero-shot evaluation of 39 VLMs, ranging from 100M to 80B parameters, shows a\nnuanced performance landscape. While VLMs are reasonably good at identifying\ncertain stylistic \\textit{data-types}, such as cartoons and sketches, they\nstruggle with simpler \\textit{data-types} arising from basic manipulations like\nimage rotations or additive noise. Our findings reveal that (i) model scaling\nalone yields marginal gains for contrastively-trained models like CLIP, and\n(ii) there is a pronounced drop in performance for the largest\nauto-regressively trained VLMs like OpenFlamingo. This finding points to a\nblind spot in current frontier VLMs: they excel in recognizing semantic content\nbut fail to acquire an understanding of visual \\textit{data-types} through\nscaling. By analyzing the pre-training distributions of these models and\nincorporating \\textit{data-type} information into the captions during\nfine-tuning, we achieve a significant enhancement in performance. By exploring\nthis previously uncharted task, we aim to set the stage for further advancing\nVLMs to equip them with visual data-type understanding. Code and datasets are\nreleased \\href{https://github.com/bethgelab/DataTypeIdentification}{here}.\n","authors":["Vishaal Udandarao","Max F. Burg","Samuel Albanie","Matthias Bethge"],"pdf_url":"https://arxiv.org/pdf/2310.08577v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08576v1","updated":"2023-10-12T17:59:23Z","published":"2023-10-12T17:59:23Z","title":"Learning to Act from Actionless Videos through Dense Correspondences","summary":" In this work, we present an approach to construct a video-based robot policy\ncapable of reliably executing diverse tasks across different robots and\nenvironments from few video demonstrations without using any action\nannotations. Our method leverages images as a task-agnostic representation,\nencoding both the state and action information, and text as a general\nrepresentation for specifying robot goals. By synthesizing videos that\n``hallucinate'' robot executing actions and in combination with dense\ncorrespondences between frames, our approach can infer the closed-formed action\nto execute to an environment without the need of any explicit action labels.\nThis unique capability allows us to train the policy solely based on RGB videos\nand deploy learned policies to various robotic tasks. We demonstrate the\nefficacy of our approach in learning policies on table-top manipulation and\nnavigation tasks. Additionally, we contribute an open-source framework for\nefficient video modeling, enabling the training of high-fidelity policy models\nwith four GPUs within a single day.\n","authors":["Po-Chen Ko","Jiayuan Mao","Yilun Du","Shao-Hua Sun","Joshua B. Tenenbaum"],"pdf_url":"https://arxiv.org/pdf/2310.08576v1.pdf","comment":"Project page: https://flow-diffusion.github.io/"},{"id":"http://arxiv.org/abs/2112.09726v2","updated":"2023-10-12T17:57:51Z","published":"2021-12-17T19:22:01Z","title":"Soundify: Matching Sound Effects to Video","summary":" In the art of video editing, sound helps add character to an object and\nimmerse the viewer within a space. Through formative interviews with\nprofessional editors (N=10), we found that the task of adding sounds to video\ncan be challenging. This paper presents Soundify, a system that assists editors\nin matching sounds to video. Given a video, Soundify identifies matching\nsounds, synchronizes the sounds to the video, and dynamically adjusts panning\nand volume to create spatial audio. In a human evaluation study (N=889), we\nshow that Soundify is capable of matching sounds to video out-of-the-box for a\ndiverse range of audio categories. In a within-subjects expert study (N=12), we\ndemonstrate the usefulness of Soundify in helping video editors match sounds to\nvideo with lighter workload, reduced task completion time, and improved\nusability.\n","authors":["David Chuan-En Lin","Anastasis Germanidis","Cristóbal Valenzuela","Yining Shi","Nikolas Martelaro"],"pdf_url":"https://arxiv.org/pdf/2112.09726v2.pdf","comment":"Full paper in UIST 2023; Short paper in NeurIPS 2021 ML4CD Workshop;\n Online demo: https://soundify.cc"},{"id":"http://arxiv.org/abs/2308.11606v2","updated":"2023-10-12T17:50:38Z","published":"2023-08-22T17:53:55Z","title":"StoryBench: A Multifaceted Benchmark for Continuous Story Visualization","summary":" Generating video stories from text prompts is a complex task. In addition to\nhaving high visual quality, videos need to realistically adhere to a sequence\nof text prompts whilst being consistent throughout the frames. Creating a\nbenchmark for video generation requires data annotated over time, which\ncontrasts with the single caption used often in video datasets. To fill this\ngap, we collect comprehensive human annotations on three existing datasets, and\nintroduce StoryBench: a new, challenging multi-task benchmark to reliably\nevaluate forthcoming text-to-video models. Our benchmark includes three video\ngeneration tasks of increasing difficulty: action execution, where the next\naction must be generated starting from a conditioning video; story\ncontinuation, where a sequence of actions must be executed starting from a\nconditioning video; and story generation, where a video must be generated from\nonly text prompts. We evaluate small yet strong text-to-video baselines, and\nshow the benefits of training on story-like data algorithmically generated from\nexisting video captions. Finally, we establish guidelines for human evaluation\nof video stories, and reaffirm the need of better automatic metrics for video\ngeneration. StoryBench aims at encouraging future research efforts in this\nexciting new area.\n","authors":["Emanuele Bugliarello","Hernan Moraldo","Ruben Villegas","Mohammad Babaeizadeh","Mohammad Taghi Saffar","Han Zhang","Dumitru Erhan","Vittorio Ferrari","Pieter-Jan Kindermans","Paul Voigtlaender"],"pdf_url":"https://arxiv.org/pdf/2308.11606v2.pdf","comment":"NeurIPS D&B 2023"},{"id":"http://arxiv.org/abs/2310.08541v1","updated":"2023-10-12T17:34:20Z","published":"2023-10-12T17:34:20Z","title":"Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic\n Image Design and Generation","summary":" We introduce ``Idea to Image,'' a system that enables multimodal iterative\nself-refinement with GPT-4V(ision) for automatic image design and generation.\nHumans can quickly identify the characteristics of different text-to-image\n(T2I) models via iterative explorations. This enables them to efficiently\nconvert their high-level generation ideas into effective T2I prompts that can\nproduce good images. We investigate if systems based on large multimodal models\n(LMMs) can develop analogous multimodal self-refinement abilities that enable\nexploring unknown models or environments via self-refining tries. Idea2Img\ncyclically generates revised T2I prompts to synthesize draft images, and\nprovides directional feedback for prompt revision, both conditioned on its\nmemory of the probed T2I model's characteristics. The iterative self-refinement\nbrings Idea2Img various advantages over vanilla T2I models. Notably, Idea2Img\ncan process input ideas with interleaved image-text sequences, follow ideas\nwith design instructions, and generate images of better semantic and visual\nqualities. The user preference study validates the efficacy of multimodal\niterative self-refinement on automatic image design and generation.\n","authors":["Zhengyuan Yang","Jianfeng Wang","Linjie Li","Kevin Lin","Chung-Ching Lin","Zicheng Liu","Lijuan Wang"],"pdf_url":"https://arxiv.org/pdf/2310.08541v1.pdf","comment":"Project page at https://idea2img.github.io/"},{"id":"http://arxiv.org/abs/2310.08538v1","updated":"2023-10-12T17:28:06Z","published":"2023-10-12T17:28:06Z","title":"Image2PCI -- A Multitask Learning Framework for Estimating Pavement\n Condition Indices Directly from Images","summary":" The Pavement Condition Index (PCI) is a widely used metric for evaluating\npavement performance based on the type, extent and severity of distresses\ndetected on a pavement surface. In recent times, significant progress has been\nmade in utilizing deep-learning approaches to automate PCI estimation process.\nHowever, the current approaches rely on at least two separate models to\nestimate PCI values -- one model dedicated to determining the type and extent\nand another for estimating their severity. This approach presents several\nchallenges, including complexities, high computational resource demands, and\nmaintenance burdens that necessitate careful consideration and resolution. To\novercome these challenges, the current study develops a unified multi-tasking\nmodel that predicts the PCI directly from a top-down pavement image. The\nproposed architecture is a multi-task model composed of one encoder for feature\nextraction and four decoders to handle specific tasks: two detection heads, one\nsegmentation head and one PCI estimation head. By multitasking, we are able to\nextract features from the detection and segmentation heads for automatically\nestimating the PCI directly from the images. The model performs very well on\nour benchmarked and open pavement distress dataset that is annotated for\nmultitask learning (the first of its kind). To our best knowledge, this is the\nfirst work that can estimate PCI directly from an image at real time speeds\nwhile maintaining excellent accuracy on all related tasks for crack detection\nand segmentation.\n","authors":["Neema Jakisa Owor","Hang Du","Abdulateef Daud","Armstrong Aboah","Yaw Adu-Gyamfi"],"pdf_url":"https://arxiv.org/pdf/2310.08538v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08537v1","updated":"2023-10-12T17:26:16Z","published":"2023-10-12T17:26:16Z","title":"XAI Benchmark for Visual Explanation","summary":" The rise of deep learning algorithms has led to significant advancements in\ncomputer vision tasks, but their \"black box\" nature has raised concerns\nregarding interpretability. Explainable AI (XAI) has emerged as a critical area\nof research aiming to open this \"black box\", and shed light on the\ndecision-making process of AI models. Visual explanations, as a subset of\nExplainable Artificial Intelligence (XAI), provide intuitive insights into the\ndecision-making processes of AI models handling visual data by highlighting\ninfluential areas in an input image. Despite extensive research conducted on\nvisual explanations, most evaluations are model-centered since the availability\nof corresponding real-world datasets with ground truth explanations is scarce\nin the context of image data. To bridge this gap, we introduce an XAI Benchmark\ncomprising a dataset collection from diverse topics that provide both class\nlabels and corresponding explanation annotations for images. We have processed\ndata from diverse domains to align with our unified visual explanation\nframework. We introduce a comprehensive Visual Explanation pipeline, which\nintegrates data loading, preprocessing, experimental setup, and model\nevaluation processes. This structure enables researchers to conduct fair\ncomparisons of various visual explanation techniques. In addition, we provide a\ncomprehensive review of over 10 evaluation methods for visual explanation to\nassist researchers in effectively utilizing our dataset collection. To further\nassess the performance of existing visual explanation methods, we conduct\nexperiments on selected datasets using various model-centered and ground\ntruth-centered evaluation metrics. We envision this benchmark could facilitate\nthe advancement of visual explanation models. The XAI dataset collection and\neasy-to-use code for evaluation are publicly accessible at\nhttps://xaidataset.github.io.\n","authors":["Yifei Zhang","Siyi Gu","James Song","Bo Pan","Liang Zhao"],"pdf_url":"https://arxiv.org/pdf/2310.08537v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.05173v2","updated":"2023-10-12T17:25:44Z","published":"2023-09-11T00:02:05Z","title":"DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning","summary":" Prompt tuning (PT), where a small amount of trainable soft (continuous)\nprompt vectors is affixed to the input of language models (LM), has shown\npromising results across various tasks and models for parameter-efficient\nfine-tuning (PEFT). PT stands out from other PEFT approaches because it\nmaintains competitive performance with fewer trainable parameters and does not\ndrastically scale up its parameters as the model size expands. However, PT\nintroduces additional soft prompt tokens, leading to longer input sequences,\nwhich significantly impacts training and inference time and memory usage due to\nthe Transformer's quadratic complexity. Particularly concerning for Large\nLanguage Models (LLMs) that face heavy daily querying. To address this issue,\nwe propose Decomposed Prompt Tuning (DePT), which decomposes the soft prompt\ninto a shorter soft prompt and a pair of low-rank matrices that are then\noptimised with two different learning rates. This allows DePT to achieve better\nperformance while saving over 20% memory and time costs compared to vanilla PT\nand its variants, without changing trainable parameter sizes. Through extensive\nexperiments on 23 natural language processing (NLP) and vision-language (VL)\ntasks, we demonstrate that DePT outperforms state-of-the-art PEFT approaches,\nincluding the full fine-tuning baseline in some scenarios. Additionally, we\nempirically show that DEPT grows more efficient as the model size increases.\nOur further study reveals that DePT integrates seamlessly with\nparameter-efficient transfer learning in the few-shot learning setting and\nhighlights its adaptability to various model architectures and sizes.\n","authors":["Zhengxiang Shi","Aldo Lipani"],"pdf_url":"https://arxiv.org/pdf/2309.05173v2.pdf","comment":"Code is available at https://github.com/ZhengxiangShi/DePT"},{"id":"http://arxiv.org/abs/2307.03293v2","updated":"2023-10-12T17:25:26Z","published":"2023-07-06T21:08:03Z","title":"CheXmask: a large-scale dataset of anatomical segmentation masks for\n multi-center chest x-ray images","summary":" The development of successful artificial intelligence models for chest X-ray\nanalysis relies on large, diverse datasets with high-quality annotations. While\nseveral databases of chest X-ray images have been released, most include\ndisease diagnosis labels but lack detailed pixel-level anatomical segmentation\nlabels. To address this gap, we introduce an extensive chest X-ray multi-center\nsegmentation dataset with uniform and fine-grain anatomical annotations for\nimages coming from six well-known publicly available databases: CANDID-PTX,\nChestX-ray8, Chexpert, MIMIC-CXR-JPG, Padchest, and VinDr-CXR, resulting in\n676,803 segmentation masks. Our methodology utilizes the HybridGNet model to\nensure consistent and high-quality segmentations across all datasets. Rigorous\nvalidation, including expert physician evaluation and automatic quality\ncontrol, was conducted to validate the resulting masks. Additionally, we\nprovide individualized quality indices per mask and an overall quality\nestimation per dataset. This dataset serves as a valuable resource for the\nbroader scientific community, streamlining the development and assessment of\ninnovative methodologies in chest X-ray analysis. The CheXmask dataset is\npublicly available at:\nhttps://physionet.org/content/chexmask-cxr-segmentation-data/\n","authors":["Nicolás Gaggion","Candelaria Mosquera","Lucas Mansilla","Martina Aineseder","Diego H. Milone","Enzo Ferrante"],"pdf_url":"https://arxiv.org/pdf/2307.03293v2.pdf","comment":"The CheXmask dataset is publicly available at\n https://physionet.org/content/chexmask-cxr-segmentation-data/"},{"id":"http://arxiv.org/abs/2310.08534v1","updated":"2023-10-12T17:24:05Z","published":"2023-10-12T17:24:05Z","title":"Animating Street View","summary":" We present a system that automatically brings street view imagery to life by\npopulating it with naturally behaving, animated pedestrians and vehicles. Our\napproach is to remove existing people and vehicles from the input image, insert\nmoving objects with proper scale, angle, motion, and appearance, plan paths and\ntraffic behavior, as well as render the scene with plausible occlusion and\nshadowing effects. The system achieves these by reconstructing the still image\nstreet scene, simulating crowd behavior, and rendering with consistent\nlighting, visibility, occlusions, and shadows. We demonstrate results on a\ndiverse range of street scenes including regular still images and panoramas.\n","authors":["Mengyi Shan","Brian Curless","Ira Kemelmacher-Shlizerman","Steve Seitz"],"pdf_url":"https://arxiv.org/pdf/2310.08534v1.pdf","comment":"SIGGRAPH Asia 2023 Conference Track"},{"id":"http://arxiv.org/abs/2310.08530v1","updated":"2023-10-12T17:22:58Z","published":"2023-10-12T17:22:58Z","title":"UniPose: Detecting Any Keypoints","summary":" This work proposes a unified framework called UniPose to detect keypoints of\nany articulated (e.g., human and animal), rigid, and soft objects via visual or\ntextual prompts for fine-grained vision understanding and manipulation.\nKeypoint is a structure-aware, pixel-level, and compact representation of any\nobject, especially articulated objects. Existing fine-grained promptable tasks\nmainly focus on object instance detection and segmentation but often fail to\nidentify fine-grained granularity and structured information of image and\ninstance, such as eyes, leg, paw, etc. Meanwhile, prompt-based keypoint\ndetection is still under-explored. To bridge the gap, we make the first attempt\nto develop an end-to-end prompt-based keypoint detection framework called\nUniPose to detect keypoints of any objects. As keypoint detection tasks are\nunified in this framework, we can leverage 13 keypoint detection datasets with\n338 keypoints across 1,237 categories over 400K instances to train a generic\nkeypoint detection model. UniPose can effectively align text-to-keypoint and\nimage-to-keypoint due to the mutual enhancement of textual and visual prompts\nbased on the cross-modality contrastive learning optimization objectives. Our\nexperimental results show that UniPose has strong fine-grained localization and\ngeneralization abilities across image styles, categories, and poses. Based on\nUniPose as a generalist keypoint detector, we hope it could serve fine-grained\nvisual perception, understanding, and generation.\n","authors":["Jie Yang","Ailing Zeng","Ruimao Zhang","Lei Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.08530v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08529v1","updated":"2023-10-12T17:22:24Z","published":"2023-10-12T17:22:24Z","title":"GaussianDreamer: Fast Generation from Text to 3D Gaussian Splatting with\n Point Cloud Priors","summary":" In recent times, the generation of 3D assets from text prompts has shown\nimpressive results. Both 2D and 3D diffusion models can generate decent 3D\nobjects based on prompts. 3D diffusion models have good 3D consistency, but\ntheir quality and generalization are limited as trainable 3D data is expensive\nand hard to obtain. 2D diffusion models enjoy strong abilities of\ngeneralization and fine generation, but the 3D consistency is hard to\nguarantee. This paper attempts to bridge the power from the two types of\ndiffusion models via the recent explicit and efficient 3D Gaussian splatting\nrepresentation. A fast 3D generation framework, named as \\name, is proposed,\nwhere the 3D diffusion model provides point cloud priors for initialization and\nthe 2D diffusion model enriches the geometry and appearance. Operations of\nnoisy point growing and color perturbation are introduced to enhance the\ninitialized Gaussians. Our \\name can generate a high-quality 3D instance within\n25 minutes on one GPU, much faster than previous methods, while the generated\ninstances can be directly rendered in real time. Demos and code are available\nat https://taoranyi.com/gaussiandreamer/.\n","authors":["Taoran Yi","Jiemin Fang","Guanjun Wu","Lingxi Xie","Xiaopeng Zhang","Wenyu Liu","Qi Tian","Xinggang Wang"],"pdf_url":"https://arxiv.org/pdf/2310.08529v1.pdf","comment":"Work in progress. Project page: https://taoranyi.com/gaussiandreamer/"},{"id":"http://arxiv.org/abs/2310.08528v1","updated":"2023-10-12T17:21:41Z","published":"2023-10-12T17:21:41Z","title":"4D Gaussian Splatting for Real-Time Dynamic Scene Rendering","summary":" Representing and rendering dynamic scenes has been an important but\nchallenging task. Especially, to accurately model complex motions, high\nefficiency is usually hard to maintain. We introduce the 4D Gaussian Splatting\n(4D-GS) to achieve real-time dynamic scene rendering while also enjoying high\ntraining and storage efficiency. An efficient deformation field is constructed\nto model both Gaussian motions and shape deformations. Different adjacent\nGaussians are connected via a HexPlane to produce more accurate position and\nshape deformations. Our 4D-GS method achieves real-time rendering under high\nresolutions, 70 FPS at a 800$\\times$800 resolution on an RTX 3090 GPU, while\nmaintaining comparable or higher quality than previous state-of-the-art\nmethods. More demos and code are available at\nhttps://guanjunwu.github.io/4dgs/.\n","authors":["Guanjun Wu","Taoran Yi","Jiemin Fang","Lingxi Xie","Xiaopeng Zhang","Wei Wei","Wenyu Liu","Qi Tian","Xinggang Wang"],"pdf_url":"https://arxiv.org/pdf/2310.08528v1.pdf","comment":"Work in progress. Project page: https://guanjunwu.github.io/4dgs/"},{"id":"http://arxiv.org/abs/2310.08501v1","updated":"2023-10-12T16:59:50Z","published":"2023-10-12T16:59:50Z","title":"Unsupervised Learning of Object-Centric Embeddings for Cell Instance\n Segmentation in Microscopy Images","summary":" Segmentation of objects in microscopy images is required for many biomedical\napplications. We introduce object-centric embeddings (OCEs), which embed image\npatches such that the spatial offsets between patches cropped from the same\nobject are preserved. Those learnt embeddings can be used to delineate\nindividual objects and thus obtain instance segmentations. Here, we show\ntheoretically that, under assumptions commonly found in microscopy images, OCEs\ncan be learnt through a self-supervised task that predicts the spatial offset\nbetween image patches. Together, this forms an unsupervised cell instance\nsegmentation method which we evaluate on nine diverse large-scale microscopy\ndatasets. Segmentations obtained with our method lead to substantially improved\nresults, compared to state-of-the-art baselines on six out of nine datasets,\nand perform on par on the remaining three datasets. If ground-truth annotations\nare available, our method serves as an excellent starting point for supervised\ntraining, reducing the required amount of ground-truth needed by one order of\nmagnitude, thus substantially increasing the practical applicability of our\nmethod. Source code is available at https://github.com/funkelab/cellulus.\n","authors":["Steffen Wolf","Manan Lalit","Henry Westmacott","Katie McDole","Jan Funke"],"pdf_url":"https://arxiv.org/pdf/2310.08501v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.06823v2","updated":"2023-10-12T16:42:55Z","published":"2023-10-10T17:53:36Z","title":"NECO: NEural Collapse Based Out-of-distribution detection","summary":" Detecting out-of-distribution (OOD) data is a critical challenge in machine\nlearning due to model overconfidence, often without awareness of their\nepistemological limits. We hypothesize that ``neural collapse'', a phenomenon\naffecting in-distribution data for models trained beyond loss convergence, also\ninfluences OOD data. To benefit from this interplay, we introduce NECO, a novel\npost-hoc method for OOD detection, which leverages the geometric properties of\n``neural collapse'' and of principal component spaces to identify OOD data. Our\nextensive experiments demonstrate that NECO achieves state-of-the-art results\non both small and large-scale OOD detection tasks while exhibiting strong\ngeneralization capabilities across different network architectures.\nFurthermore, we provide a theoretical explanation for the effectiveness of our\nmethod in OOD detection. We plan to release the code after the anonymity\nperiod.\n","authors":["Mouïn Ben Ammar","Nacim Belkhir","Sebastian Popescu","Antoine Manzanera","Gianni Franchi"],"pdf_url":"https://arxiv.org/pdf/2310.06823v2.pdf","comment":"28 pages"},{"id":"http://arxiv.org/abs/2310.08475v1","updated":"2023-10-12T16:32:44Z","published":"2023-10-12T16:32:44Z","title":"Can We Edit Multimodal Large Language Models?","summary":" In this paper, we focus on editing Multimodal Large Language Models (MLLMs).\nCompared to editing single-modal LLMs, multimodal model editing is more\nchallenging, which demands a higher level of scrutiny and careful consideration\nin the editing process. To facilitate research in this area, we construct a new\nbenchmark, dubbed MMEdit, for editing multimodal LLMs and establishing a suite\nof innovative metrics for evaluation. We conduct comprehensive experiments\ninvolving various model editing baselines and analyze the impact of editing\ndifferent components for multimodal LLMs. Empirically, we notice that previous\nbaselines can implement editing multimodal LLMs to some extent, but the effect\nis still barely satisfactory, indicating the potential difficulty of this task.\nWe hope that our work can provide the NLP community with insights\\footnote{Code\nand dataset are available in https://github.com/zjunlp/EasyEdit.\n","authors":["Siyuan Cheng","Bozhong Tian","Qingbin Liu","Xi Chen","Yongheng Wang","Huajun Chen","Ningyu Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.08475v1.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.08465v1","updated":"2023-10-12T16:26:18Z","published":"2023-10-12T16:26:18Z","title":"MotionDirector: Motion Customization of Text-to-Video Diffusion Models","summary":" Large-scale pre-trained diffusion models have exhibited remarkable\ncapabilities in diverse video generations. Given a set of video clips of the\nsame motion concept, the task of Motion Customization is to adapt existing\ntext-to-video diffusion models to generate videos with this motion. For\nexample, generating a video with a car moving in a prescribed manner under\nspecific camera movements to make a movie, or a video illustrating how a bear\nwould lift weights to inspire creators. Adaptation methods have been developed\nfor customizing appearance like subject or style, yet unexplored for motion. It\nis straightforward to extend mainstream adaption methods for motion\ncustomization, including full model tuning, parameter-efficient tuning of\nadditional layers, and Low-Rank Adaptions (LoRAs). However, the motion concept\nlearned by these methods is often coupled with the limited appearances in the\ntraining videos, making it difficult to generalize the customized motion to\nother appearances. To overcome this challenge, we propose MotionDirector, with\na dual-path LoRAs architecture to decouple the learning of appearance and\nmotion. Further, we design a novel appearance-debiased temporal loss to\nmitigate the influence of appearance on the temporal training objective.\nExperimental results show the proposed method can generate videos of diverse\nappearances for the customized motions. Our method also supports various\ndownstream applications, such as the mixing of different videos with their\nappearance and motion respectively, and animating a single image with\ncustomized motions. Our code and model weights will be released.\n","authors":["Rui Zhao","Yuchao Gu","Jay Zhangjie Wu","David Junhao Zhang","Jiawei Liu","Weijia Wu","Jussi Keppo","Mike Zheng Shou"],"pdf_url":"https://arxiv.org/pdf/2310.08465v1.pdf","comment":"Project Page: https://showlab.github.io/MotionDirector/"},{"id":"http://arxiv.org/abs/2310.08451v1","updated":"2023-10-12T16:11:13Z","published":"2023-10-12T16:11:13Z","title":"Proving the Potential of Skeleton Based Action Recognition to Automate\n the Analysis of Manual Processes","summary":" In manufacturing sectors such as textiles and electronics, manual processes\nare a fundamental part of production. The analysis and monitoring of the\nprocesses is necessary for efficient production design. Traditional methods for\nanalyzing manual processes are complex, expensive, and inflexible. Compared to\nestablished approaches such as Methods-Time-Measurement (MTM), machine learning\n(ML) methods promise: Higher flexibility, self-sufficient & permanent use,\nlower costs. In this work, based on a video stream, the current motion class in\na manual assembly process is detected. With information on the current motion,\nKey-Performance-Indicators (KPIs) can be derived easily. A skeleton-based\naction recognition approach is taken, as this field recently shows major\nsuccess in machine vision tasks. For skeleton-based action recognition in\nmanual assembly, no sufficient pre-work could be found. Therefore, a ML\npipeline is developed, to enable extensive research on different (pre-)\nprocessing methods and neural nets. Suitable well generalizing approaches are\nfound, proving the potential of ML to enhance analyzation of manual processes.\nModels detect the current motion, performed by an operator in manual assembly,\nbut the results can be transferred to all kinds of manual processes.\n","authors":["Marlin Berger","Frederik Cloppenburg","Jens Eufinger","Thomas Gries"],"pdf_url":"https://arxiv.org/pdf/2310.08451v1.pdf","comment":"16 pages, 6 figures. Find peer-reviewed version in Proceedings of\n IntelliSys 2023"},{"id":"http://arxiv.org/abs/2310.08442v1","updated":"2023-10-12T16:04:41Z","published":"2023-10-12T16:04:41Z","title":"Debias the Training of Diffusion Models","summary":" Diffusion models have demonstrated compelling generation quality by\noptimizing the variational lower bound through a simple denoising score\nmatching loss. In this paper, we provide theoretical evidence that the\nprevailing practice of using a constant loss weight strategy in diffusion\nmodels leads to biased estimation during the training phase. Simply optimizing\nthe denoising network to predict Gaussian noise with constant weighting may\nhinder precise estimations of original images. To address the issue, we propose\nan elegant and effective weighting strategy grounded in the theoretically\nunbiased principle. Moreover, we conduct a comprehensive and systematic\nexploration to dissect the inherent bias problem deriving from constant\nweighting loss from the perspectives of its existence, impact and reasons.\nThese analyses are expected to advance our understanding and demystify the\ninner workings of diffusion models. Through empirical evaluation, we\ndemonstrate that our proposed debiased estimation method significantly enhances\nsample quality without the reliance on complex techniques, and exhibits\nimproved efficiency compared to the baseline method both in training and\nsampling processes.\n","authors":["Hu Yu","Li Shen","Jie Huang","Man Zhou","Hongsheng Li","Feng Zhao"],"pdf_url":"https://arxiv.org/pdf/2310.08442v1.pdf","comment":"University of Science and Technology of China, Alibaba Group, The\n Chinese University of Hong Kong"},{"id":"http://arxiv.org/abs/2310.08430v1","updated":"2023-10-12T15:53:47Z","published":"2023-10-12T15:53:47Z","title":"Assessing of Soil Erosion Risk Through Geoinformation Sciences and\n Remote Sensing -- A Review","summary":" During past decades a marked manifestation of widespread erosion phenomena\nwas studied worldwide. Global conservation community has launched campaigns at\nlocal, regional and continental level in developing countries for preservation\nof soil resources in order not only to stop or mitigate human impact on nature\nbut also to improve life in rural areas introducing new approaches for soil\ncultivation. After the adoption of Sustainable Development Goals of UNs and\nlaunching several world initiatives such as the Land Degradation Neutrality\n(LDN) the world came to realize the very importance of the soil resources on\nwhich the biosphere relies for its existence. The main goal of the chapter is\nto review different types and structures erosion models as well as their\napplications. Several methods using spatial analysis capabilities of geographic\ninformation systems (GIS) are in operation for soil erosion risk assessment,\nsuch as Universal Soil Loss Equation (USLE), Revised Universal Soil Loss\nEquation (RUSLE) in operation worldwide and in the USA and MESALES model. These\nand more models are being discussed in the present work alongside more\nexperimental models and methods for assessing soil erosion risk such as\nArtificial Intelligence (AI), Machine and Deep Learning, etc. At the end of\nthis work, a prospectus for the future development of soil erosion risk\nassessment is drawn.\n","authors":["Lachezar Filchev","Vasil Kolev"],"pdf_url":"https://arxiv.org/pdf/2310.08430v1.pdf","comment":"Chapter 21 (pages 54)"},{"id":"http://arxiv.org/abs/2310.08429v1","updated":"2023-10-12T15:53:24Z","published":"2023-10-12T15:53:24Z","title":"Revisiting Data Augmentation for Rotational Invariance in Convolutional\n Neural Networks","summary":" Convolutional Neural Networks (CNN) offer state of the art performance in\nvarious computer vision tasks. Many of those tasks require different subtypes\nof affine invariances (scale, rotational, translational) to image\ntransformations. Convolutional layers are translation equivariant by design,\nbut in their basic form lack invariances. In this work we investigate how best\nto include rotational invariance in a CNN for image classification. Our\nexperiments show that networks trained with data augmentation alone can\nclassify rotated images nearly as well as in the normal unrotated case; this\nincrease in representational power comes only at the cost of training time. We\nalso compare data augmentation versus two modified CNN models for achieving\nrotational invariance or equivariance, Spatial Transformer Networks and Group\nEquivariant CNNs, finding no significant accuracy increase with these\nspecialized methods. In the case of data augmented networks, we also analyze\nwhich layers help the network to encode the rotational invariance, which is\nimportant for understanding its limitations and how to best retrain a network\nwith data augmentation to achieve invariance to rotation.\n","authors":["Facundo Manuel Quiroga","Franco Ronchetti","Laura Lanzarini","Aurelio Fernandez-Bariviera"],"pdf_url":"https://arxiv.org/pdf/2310.08429v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2205.05188v3","updated":"2023-10-12T15:43:01Z","published":"2022-05-10T21:55:26Z","title":"On Scale Space Radon Transform, Properties and Application in CT Image\n Reconstruction","summary":" Since the Radon transform (RT) consists in a line integral function, some\nmodeling assumptions are made on Computed Tomography (CT) system, making image\nreconstruction analytical methods, such as Filtered Backprojection (FBP),\nsensitive to artifacts and noise. In the other hand, recently, a new integral\ntransform, called Scale Space Radon Transform (SSRT), is introduced where, RT\nis a particular case. Thanks to its interesting properties, such as good scale\nspace behavior, the SSRT has known number of new applications. In this paper,\nwith the aim to improve the reconstructed image quality for these methods, we\npropose to model the X-ray beam with the Scale Space Radon Transform (SSRT)\nwhere, the assumptions done on the physical dimensions of the CT system\nelements reflect better the reality. After depicting the basic properties and\nthe inversion of SSRT, the FBP algorithm is used to reconstruct the image from\nthe SSRT sinogram where the RT spectrum used in FBP is replaced by SSRT and the\nGaussian kernel, expressed in their frequency domain. PSNR and SSIM, as quality\nmeasures, are used to compare RT and SSRT-based image reconstruction on\nShepp-Logan head and anthropomorphic abdominal phantoms. The first findings\nshow that the SSRT-based method outperforms the methods based on RT,\nespecially, when the number of projections is reduced, making it more\nappropriate for applications requiring low-dose radiation, such as medical\nX-ray CT. While SSRT-FBP and RT-FBP have utmost the same runtime, the\nexperiments show that SSRT-FBP is more robust to Poisson-Gaussian noise\ncorrupting CT data.\n","authors":["Nafaa Nacereddine","Djemel Ziou","Aicha Baya Goumeidane"],"pdf_url":"https://arxiv.org/pdf/2205.05188v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08421v1","updated":"2023-10-12T15:42:17Z","published":"2023-10-12T15:42:17Z","title":"\"SegLoc\": Study on Novel Visual Self-supervised Learning Scheme (Segment\n Localization) Tailored for Dense Prediction Tasks of Security Inspection\n X-ray Images","summary":" Lately, remarkable advancements of artificial intelligence have been\nattributed to the integration of self-supervised learning scheme. Despite\nimpressive achievements within NLP, yet SSL in computer vision has not been\nable to stay on track comparatively. Recently, integration of contrastive\nlearning on top of existing SSL models has established considerable progress in\ncomputer vision through which visual SSL models have outperformed their\nsupervised counterparts. Nevertheless, most of these improvements were limited\nto classification tasks, and also, few works have been dedicated to evaluation\nof SSL models in real-world scenarios of computer vision, while the majority of\nworks are centered around datasets containing class-wise portrait images, most\nnotably, ImageNet. Consequently, in this work, we have considered dense\nprediction task of semantic segmentation in security inspection x-ray images to\nevaluate our proposed model Segmentation Localization. Based upon the model\nInstance Localization, our model SegLoc has managed to address one of the most\nchallenging downsides of contrastive learning, i.e., false negative pairs of\nquery embeddings. In order to do so, in contrast to baseline model InsLoc, our\npretraining dataset is synthesized by cropping, transforming, then pasting\nalready labeled segments from an available labeled dataset, foregrounds, onto\ninstances of an unlabeled dataset, backgrounds. In our case, PIDray and SIXray\ndatasets are considered as labeled and unlabeled datasets, respectively.\nMoreover, we fully harness labels by avoiding false negative pairs through\nimplementing the idea, one queue per class, in MoCo-v2 whereby negative pairs\ncorresponding to each query are extracted from its corresponding queue within\nthe memory bank. Our approach has outperformed random initialization by 3% to\n6%, while having underperformed supervised initialization.\n","authors":["Shervin Halat","Mohammad Rahmati","Ehsan Nazerfard"],"pdf_url":"https://arxiv.org/pdf/2310.08421v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08420v1","updated":"2023-10-12T15:39:54Z","published":"2023-10-12T15:39:54Z","title":"Visual Attention-Prompted Prediction and Learning","summary":" Explanation(attention)-guided learning is a method that enhances a model's\npredictive power by incorporating human understanding during the training\nphase. While attention-guided learning has shown promising results, it often\ninvolves time-consuming and computationally expensive model retraining. To\naddress this issue, we introduce the attention-prompted prediction technique,\nwhich enables direct prediction guided by the attention prompt without the need\nfor model retraining. However, this approach presents several challenges,\nincluding: 1) How to incorporate the visual attention prompt into the model's\ndecision-making process and leverage it for future predictions even in the\nabsence of a prompt? and 2) How to handle the incomplete information from the\nvisual attention prompt? To tackle these challenges, we propose a novel\nframework called Visual Attention-Prompted Prediction and Learning, which\nseamlessly integrates visual attention prompts into the model's decision-making\nprocess and adapts to images both with and without attention prompts for\nprediction. To address the incomplete information of the visual attention\nprompt, we introduce a perturbation-based attention map modification method.\nAdditionally, we propose an optimization-based mask aggregation method with a\nnew weight learning function for adaptive perturbed annotation aggregation in\nthe attention map modification process. Our overall framework is designed to\nlearn in an attention-prompt guided multi-task manner to enhance future\npredictions even for samples without attention prompts and trained in an\nalternating manner for better convergence. Extensive experiments conducted on\ntwo datasets demonstrate the effectiveness of our proposed framework in\nenhancing predictions for samples, both with and without provided prompts.\n","authors":["Yifei Zhang","Siyi Gu","Bo Pan","Guangji Bai","Xiaofeng Yang","Liang Zhao"],"pdf_url":"https://arxiv.org/pdf/2310.08420v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.11713v4","updated":"2023-10-12T15:30:41Z","published":"2023-02-23T00:33:54Z","title":"Can Pre-trained Vision and Language Models Answer Visual\n Information-Seeking Questions?","summary":" Pre-trained vision and language models have demonstrated state-of-the-art\ncapabilities over existing tasks involving images and texts, including visual\nquestion answering. However, it remains unclear whether these models possess\nthe capability to answer questions that are not only querying visual content\nbut knowledge-intensive and information-seeking. In this study, we introduce\nInfoSeek, a visual question answering dataset tailored for information-seeking\nquestions that cannot be answered with only common sense knowledge. Using\nInfoSeek, we analyze various pre-trained visual question answering models and\ngain insights into their characteristics. Our findings reveal that\nstate-of-the-art pre-trained multi-modal models (e.g., PaLI-X, BLIP2, etc.)\nface challenges in answering visual information-seeking questions, but\nfine-tuning on the InfoSeek dataset elicits models to use fine-grained\nknowledge that was learned during their pre-training. Furthermore, we show that\naccurate visual entity recognition can be used to improve performance on\nInfoSeek by retrieving relevant documents, showing a significant space for\nimprovement.\n","authors":["Yang Chen","Hexiang Hu","Yi Luan","Haitian Sun","Soravit Changpinyo","Alan Ritter","Ming-Wei Chang"],"pdf_url":"https://arxiv.org/pdf/2302.11713v4.pdf","comment":"EMNLP 2023 (main conference); Our dataset and evaluation is available\n at https://open-vision-language.github.io/infoseek/"},{"id":"http://arxiv.org/abs/2310.08398v1","updated":"2023-10-12T15:09:12Z","published":"2023-10-12T15:09:12Z","title":"Towards Design and Development of an ArUco Markers-Based Quantitative\n Surface Tactile Sensor","summary":" In this paper, with the goal of quantifying the qualitative image outputs of\na Vision-based Tactile Sensor (VTS), we present the design, fabrication, and\ncharacterization of a novel Quantitative Surface Tactile Sensor (called QS-TS).\nQS-TS directly estimates the sensor's gel layer deformation in real-time\nenabling safe and autonomous tactile manipulation and servoing of delicate\nobjects using robotic manipulators. The core of the proposed sensor is the\nutilization of miniature 1.5 mm x 1.5 mm synthetic square markers with inner\nbinary patterns and a broad black border, called ArUco Markers. Each ArUco\nmarker can provide real-time camera pose estimation that, in our design, is\nused as a quantitative measure for obtaining deformation of the QS-TS gel\nlayer. Moreover, thanks to the use of ArUco markers, we propose a unique\nfabrication procedure that mitigates various challenges associated with the\nfabrication of the existing marker-based VTSs and offers an intuitive and\nless-arduous method for the construction of the VTS. Remarkably, the proposed\nfabrication facilitates the integration and adherence of markers with the gel\nlayer to robustly and reliably obtain a quantitative measure of deformation in\nreal-time regardless of the orientation of ArUco Markers. The performance and\nefficacy of the proposed QS-TS in estimating the deformation of the sensor's\ngel layer were experimentally evaluated and verified. Results demonstrate the\nphenomenal performance of the QS-TS in estimating the deformation of the gel\nlayer with a relative error of <5%.\n","authors":["Ozdemir Can Kara","Charles Everson","Farshid Alambeigi"],"pdf_url":"https://arxiv.org/pdf/2310.08398v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.02621v2","updated":"2023-10-12T15:05:15Z","published":"2023-04-05T17:43:57Z","title":"High-fidelity Pseudo-labels for Boosting Weakly-Supervised Segmentation","summary":" Image-level weakly-supervised semantic segmentation (WSSS) reduces the\nusually vast data annotation cost by surrogate segmentation masks during\ntraining. The typical approach involves training an image classification\nnetwork using global average pooling (GAP) on convolutional feature maps. This\nenables the estimation of object locations based on class activation maps\n(CAMs), which identify the importance of image regions. The CAMs are then used\nto generate pseudo-labels, in the form of segmentation masks, to supervise a\nsegmentation model in the absence of pixel-level ground truth. Our work is\nbased on two techniques for improving CAMs; importance sampling, which is a\nsubstitute for GAP, and the feature similarity loss, which utilizes a heuristic\nthat object contours almost always align with color edges in images. However,\nboth are based on the multinomial posterior with softmax, and implicitly assume\nthat classes are mutually exclusive, which turns out suboptimal in our\nexperiments. Thus, we reformulate both techniques based on binomial posteriors\nof multiple independent binary problems. This has two benefits; their\nperformance is improved and they become more general, resulting in an add-on\nmethod that can boost virtually any WSSS method. This is demonstrated on a wide\nvariety of baselines on the PASCAL VOC dataset, improving the region similarity\nand contour quality of all implemented state-of-the-art methods. Experiments on\nthe MS COCO dataset show that our proposed add-on is well-suited for\nlarge-scale settings. Our code is available at https://github.com/arvijj/hfpl.\n","authors":["Arvi Jonnarth","Yushan Zhang","Michael Felsberg"],"pdf_url":"https://arxiv.org/pdf/2304.02621v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.04946v2","updated":"2023-10-12T15:04:30Z","published":"2023-09-10T06:33:17Z","title":"Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation","summary":" Audio-driven talking-head synthesis is a popular research topic for virtual\nhuman-related applications. However, the inflexibility and inefficiency of\nexisting methods, which necessitate expensive end-to-end training to transfer\nemotions from guidance videos to talking-head predictions, are significant\nlimitations. In this work, we propose the Emotional Adaptation for Audio-driven\nTalking-head (EAT) method, which transforms emotion-agnostic talking-head\nmodels into emotion-controllable ones in a cost-effective and efficient manner\nthrough parameter-efficient adaptations. Our approach utilizes a pretrained\nemotion-agnostic talking-head transformer and introduces three lightweight\nadaptations (the Deep Emotional Prompts, Emotional Deformation Network, and\nEmotional Adaptation Module) from different perspectives to enable precise and\nrealistic emotion controls. Our experiments demonstrate that our approach\nachieves state-of-the-art performance on widely-used benchmarks, including LRW\nand MEAD. Additionally, our parameter-efficient adaptations exhibit remarkable\ngeneralization ability, even in scenarios where emotional training videos are\nscarce or nonexistent. Project website: https://yuangan.github.io/eat/\n","authors":["Yuan Gan","Zongxin Yang","Xihang Yue","Lingyun Sun","Yi Yang"],"pdf_url":"https://arxiv.org/pdf/2309.04946v2.pdf","comment":"Accepted to ICCV 2023. Project page: https://yuangan.github.io/eat/"},{"id":"http://arxiv.org/abs/2310.08390v1","updated":"2023-10-12T15:00:06Z","published":"2023-10-12T15:00:06Z","title":"Hyp-UML: Hyperbolic Image Retrieval with Uncertainty-aware Metric\n Learning","summary":" Metric learning plays a critical role in training image retrieval and\nclassification. It is also a key algorithm in representation learning, e.g.,\nfor feature learning and its alignment in metric space. Hyperbolic embedding\nhas been recently developed, compared to the conventional Euclidean embedding\nin most of the previously developed models, and can be more effective in\nrepresenting the hierarchical data structure. Second, uncertainty\nestimation/measurement is a long-lasting challenge in artificial intelligence.\nSuccessful uncertainty estimation can improve a machine learning model's\nperformance, robustness, and security. In Hyperbolic space, uncertainty\nmeasurement is at least with equivalent, if not more, critical importance. In\nthis paper, we develop a Hyperbolic image embedding with uncertainty-aware\nmetric learning for image retrieval. We call our method Hyp-UML: Hyperbolic\nUncertainty-aware Metric Learning. Our contribution are threefold: we propose\nan image embedding algorithm based on Hyperbolic space, with their\ncorresponding uncertainty value; we propose two types of uncertainty-aware\nmetric learning, for the popular Contrastive learning and conventional\nmargin-based metric learning, respectively. We perform extensive experimental\nvalidations to prove that the proposed algorithm can achieve state-of-the-art\nresults among related methods. The comprehensive ablation study validates the\neffectiveness of each component of the proposed algorithm.\n","authors":["Shiyang Yan","Zongxuan Liu","Lin Xu"],"pdf_url":"https://arxiv.org/pdf/2310.08390v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08387v1","updated":"2023-10-12T14:59:22Z","published":"2023-10-12T14:59:22Z","title":"MeanAP-Guided Reinforced Active Learning for Object Detection","summary":" Active learning presents a promising avenue for training high-performance\nmodels with minimal labeled data, achieved by judiciously selecting the most\ninformative instances to label and incorporating them into the task learner.\nDespite notable advancements in active learning for image recognition, metrics\ndevised or learned to gauge the information gain of data, crucial for query\nstrategy design, do not consistently align with task model performance metrics,\nsuch as Mean Average Precision (MeanAP) in object detection tasks. This paper\nintroduces MeanAP-Guided Reinforced Active Learning for Object Detection\n(MAGRAL), a novel approach that directly utilizes the MeanAP metric of the task\nmodel to devise a sampling strategy employing a reinforcement learning-based\nsampling agent. Built upon LSTM architecture, the agent efficiently explores\nand selects subsequent training instances, and optimizes the process through\npolicy gradient with MeanAP serving as reward. Recognizing the time-intensive\nnature of MeanAP computation at each step, we propose fast look-up tables to\nexpedite agent training. We assess MAGRAL's efficacy across popular benchmarks,\nPASCAL VOC and MS COCO, utilizing different backbone architectures. Empirical\nfindings substantiate MAGRAL's superiority over recent state-of-the-art\nmethods, showcasing substantial performance gains. MAGRAL establishes a robust\nbaseline for reinforced active object detection, signifying its potential in\nadvancing the field.\n","authors":["Zhixuan Liang","Xingyu Zeng","Rui Zhao","Ping Luo"],"pdf_url":"https://arxiv.org/pdf/2310.08387v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08381v1","updated":"2023-10-12T14:55:31Z","published":"2023-10-12T14:55:31Z","title":"AutoVP: An Automated Visual Prompting Framework and Benchmark","summary":" Visual prompting (VP) is an emerging parameter-efficient fine-tuning approach\nto adapting pre-trained vision models to solve various downstream\nimage-classification tasks. However, there has hitherto been little systematic\nstudy of the design space of VP and no clear benchmark for evaluating its\nperformance. To bridge this gap, we propose AutoVP, an end-to-end expandable\nframework for automating VP design choices, along with 12 downstream\nimage-classification tasks that can serve as a holistic VP-performance\nbenchmark. Our design space covers 1) the joint optimization of the prompts; 2)\nthe selection of pre-trained models, including image classifiers and text-image\nencoders; and 3) model output mapping strategies, including nonparametric and\ntrainable label mapping. Our extensive experimental results show that AutoVP\noutperforms the best-known current VP methods by a substantial margin, having\nup to 6.7% improvement in accuracy; and attains a maximum performance increase\nof 27.5% compared to linear-probing (LP) baseline. AutoVP thus makes a two-fold\ncontribution: serving both as an efficient tool for hyperparameter tuning on VP\ndesign choices, and as a comprehensive benchmark that can reasonably be\nexpected to accelerate VP's development. The source code is available at\nhttps://github.com/IBM/AutoVP.\n","authors":["Hsi-Ai Tsao","Lei Hsiung","Pin-Yu Chen","Sijia Liu","Tsung-Yi Ho"],"pdf_url":"https://arxiv.org/pdf/2310.08381v1.pdf","comment":"Preprint. The code is available at https://github.com/IBM/AutoVP"},{"id":"http://arxiv.org/abs/2310.08371v1","updated":"2023-10-12T14:40:24Z","published":"2023-10-12T14:40:24Z","title":"Worst-Case Morphs using Wasserstein ALI and Improved MIPGAN","summary":" A lot of progress has been made in the last years on using Generative\nAdversarial Networks (GAN) to create realistic images. However, to be able\nreconstruct images or to generate images using real data as input, an Encoder\nis needed that reverses the mapping from the GAN's latent space to image space.\nThis means that three networks are needed: an Encoder, a Decoder (called\nGenerator in a normal GAN) and a Discriminator. These three networks can be\ntrained from scratch simultaneously (Adversarially Learned Inference), or\nalternatively an Encoder network can be trained that maps images into the\nlatent space of a \\textit{pretrained} GAN model (Inverse GAN). In the latter\ncase, the networks are trained consecutively, so the Encoder has to make do\nwith whatever model the Decoder learned during GAN training. Training three\nnetworks simultaneously is more unstable and therefore more challenging, but it\nis possible that the Encoder and Decoder benefit from interacting with each\nother during training. We compare the two different approaches and discuss\nwhether it is worth the extra effort to train all three networks\nsimultaneously.\n","authors":["Una M. Kelly","Meike Nauta","Lu Liu","Luuk J. Spreeuwers","Raymond N. J. Veldhuis"],"pdf_url":"https://arxiv.org/pdf/2310.08371v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08370v1","updated":"2023-10-12T14:39:58Z","published":"2023-10-12T14:39:58Z","title":"UniPAD: A Universal Pre-training Paradigm for Autonomous Driving","summary":" In the context of autonomous driving, the significance of effective feature\nlearning is widely acknowledged. While conventional 3D self-supervised\npre-training methods have shown widespread success, most methods follow the\nideas originally designed for 2D images. In this paper, we present UniPAD, a\nnovel self-supervised learning paradigm applying 3D volumetric differentiable\nrendering. UniPAD implicitly encodes 3D space, facilitating the reconstruction\nof continuous 3D shape structures and the intricate appearance characteristics\nof their 2D projections. The flexibility of our method enables seamless\nintegration into both 2D and 3D frameworks, enabling a more holistic\ncomprehension of the scenes. We manifest the feasibility and effectiveness of\nUniPAD by conducting extensive experiments on various downstream 3D tasks. Our\nmethod significantly improves lidar-, camera-, and lidar-camera-based baseline\nby 9.1, 7.7, and 6.9 NDS, respectively. Notably, our pre-training pipeline\nachieves 73.2 NDS for 3D object detection and 79.4 mIoU for 3D semantic\nsegmentation on the nuScenes validation set, achieving state-of-the-art results\nin comparison with previous methods. The code will be available at\nhttps://github.com/Nightmare-n/UniPAD.\n","authors":["Honghui Yang","Sha Zhang","Di Huang","Xiaoyang Wu","Haoyi Zhu","Tong He","Shixiang Tang","Hengshuang Zhao","Qibo Qiu","Binbin Lin","Xiaofei He","Wanli Ouyang"],"pdf_url":"https://arxiv.org/pdf/2310.08370v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08368v1","updated":"2023-10-12T14:38:52Z","published":"2023-10-12T14:38:52Z","title":"Mapping Memes to Words for Multimodal Hateful Meme Classification","summary":" Multimodal image-text memes are prevalent on the internet, serving as a\nunique form of communication that combines visual and textual elements to\nconvey humor, ideas, or emotions. However, some memes take a malicious turn,\npromoting hateful content and perpetuating discrimination. Detecting hateful\nmemes within this multimodal context is a challenging task that requires\nunderstanding the intertwined meaning of text and images. In this work, we\naddress this issue by proposing a novel approach named ISSUES for multimodal\nhateful meme classification. ISSUES leverages a pre-trained CLIP\nvision-language model and the textual inversion technique to effectively\ncapture the multimodal semantic content of the memes. The experiments show that\nour method achieves state-of-the-art results on the Hateful Memes Challenge and\nHarMeme datasets. The code and the pre-trained models are publicly available at\nhttps://github.com/miccunifi/ISSUES.\n","authors":["Giovanni Burbi","Alberto Baldrati","Lorenzo Agnolucci","Marco Bertini","Alberto Del Bimbo"],"pdf_url":"https://arxiv.org/pdf/2310.08368v1.pdf","comment":"ICCV2023 CLVL Workshop"},{"id":"http://arxiv.org/abs/2310.08367v1","updated":"2023-10-12T14:38:25Z","published":"2023-10-12T14:38:25Z","title":"MCU: A Task-centric Framework for Open-ended Agent Evaluation in\n Minecraft","summary":" To pursue the goal of creating an open-ended agent in Minecraft, an\nopen-ended game environment with unlimited possibilities, this paper introduces\na task-centric framework named MCU for Minecraft agent evaluation. The MCU\nframework leverages the concept of atom tasks as fundamental building blocks,\nenabling the generation of diverse or even arbitrary tasks. Within the MCU\nframework, each task is measured with six distinct difficulty scores (time\nconsumption, operational effort, planning complexity, intricacy, creativity,\nnovelty). These scores offer a multi-dimensional assessment of a task from\ndifferent angles, and thus can reveal an agent's capability on specific facets.\nThe difficulty scores also serve as the feature of each task, which creates a\nmeaningful task space and unveils the relationship between tasks. For efficient\nevaluation of Minecraft agents employing the MCU framework, we maintain a\nunified benchmark, namely SkillForge, which comprises representative tasks with\ndiverse categories and difficulty distribution. We also provide convenient\nfilters for users to select tasks to assess specific capabilities of agents. We\nshow that MCU has the high expressivity to cover all tasks used in recent\nliterature on Minecraft agent, and underscores the need for advancements in\nareas such as creativity, precise control, and out-of-distribution\ngeneralization under the goal of open-ended Minecraft agent development.\n","authors":["Haowei Lin","Zihao Wang","Jianzhu Ma","Yitao Liang"],"pdf_url":"https://arxiv.org/pdf/2310.08367v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.04099v2","updated":"2023-10-12T14:18:52Z","published":"2023-10-06T09:01:15Z","title":"ClusVPR: Efficient Visual Place Recognition with Clustering-based\n Weighted Transformer","summary":" Visual place recognition (VPR) is a highly challenging task that has a wide\nrange of applications, including robot navigation and self-driving vehicles.\nVPR is particularly difficult due to the presence of duplicate regions and the\nlack of attention to small objects in complex scenes, resulting in recognition\ndeviations. In this paper, we present ClusVPR, a novel approach that tackles\nthe specific issues of redundant information in duplicate regions and\nrepresentations of small objects. Different from existing methods that rely on\nConvolutional Neural Networks (CNNs) for feature map generation, ClusVPR\nintroduces a unique paradigm called Clustering-based Weighted Transformer\nNetwork (CWTNet). CWTNet leverages the power of clustering-based weighted\nfeature maps and integrates global dependencies to effectively address visual\ndeviations encountered in large-scale VPR problems. We also introduce the\noptimized-VLAD (OptLAD) layer that significantly reduces the number of\nparameters and enhances model efficiency. This layer is specifically designed\nto aggregate the information obtained from scale-wise image patches.\nAdditionally, our pyramid self-supervised strategy focuses on extracting\nrepresentative and diverse information from scale-wise image patches instead of\nentire images, which is crucial for capturing representative and diverse\ninformation in VPR. Extensive experiments on four VPR datasets show our model's\nsuperior performance compared to existing models while being less complex.\n","authors":["Yifan Xu","Pourya Shamsolmoali","Jie Yang"],"pdf_url":"https://arxiv.org/pdf/2310.04099v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08339v1","updated":"2023-10-12T13:57:32Z","published":"2023-10-12T13:57:32Z","title":"A Generic Software Framework for Distributed Topological Analysis\n Pipelines","summary":" This system paper presents a software framework for the support of\ntopological analysis pipelines in a distributed-memory model. While several\nrecent papers introduced topology-based approaches for distributed-memory\nenvironments, these were reporting experiments obtained with tailored,\nmono-algorithm implementations. In contrast, we describe in this paper a\ngeneral-purpose, generic framework for topological analysis pipelines, i.e. a\nsequence of topological algorithms interacting together, possibly on distinct\nnumbers of processes. Specifically, we instantiated our framework with the MPI\nmodel, within the Topology ToolKit (TTK). While developing this framework, we\nfaced several algorithmic and software engineering challenges, which we\ndocument in this paper. We provide a taxonomy for the distributed-memory\ntopological algorithms supported by TTK, depending on their communication needs\nand provide examples of hybrid MPI+thread parallelizations. Detailed\nperformance analyses show that parallel efficiencies range from $20\\%$ to\n$80\\%$ (depending on the algorithms), and that the MPI-specific preconditioning\nintroduced by our framework induces a negligible computation time overhead. We\nillustrate the new distributed-memory capabilities of TTK with an example of\nadvanced analysis pipeline, combining multiple algorithms, run on the largest\npublicly available dataset we have found (120 billion vertices) on a standard\ncluster with 64 nodes (for a total of 1,536 cores). Finally, we provide a\nroadmap for the completion of TTK's MPI extension, along with generic\nrecommendations for each algorithm communication category.\n","authors":["Eve Le Guillou","Michael Will","Pierre Guillou","Jonas Lukasczyk","Pierre Fortin","Christoph Garth","Julien Tierny"],"pdf_url":"https://arxiv.org/pdf/2310.08339v1.pdf","comment":"18 pages, 12 figures"},{"id":"http://arxiv.org/abs/2206.02136v3","updated":"2023-10-12T13:55:06Z","published":"2022-06-05T09:39:12Z","title":"LDRNet: Enabling Real-time Document Localization on Mobile Devices","summary":" While Identity Document Verification (IDV) technology on mobile devices\nbecomes ubiquitous in modern business operations, the risk of identity theft\nand fraud is increasing. The identity document holder is normally required to\nparticipate in an online video interview to circumvent impostors. However, the\ncurrent IDV process depends on an additional human workforce to support online\nstep-by-step guidance which is inefficient and expensive. The performance of\nexisting AI-based approaches cannot meet the real-time and lightweight demands\nof mobile devices. In this paper, we address those challenges by designing an\nedge intelligence-assisted approach for real-time IDV. Aiming at improving the\nresponsiveness of the IDV process, we propose a new document localization model\nfor mobile devices, LDRNet, to Localize the identity Document in Real-time. On\nthe basis of a lightweight backbone network, we build three prediction branches\nfor LDRNet, the corner points prediction, the line borders prediction and the\ndocument classification. We design novel supplementary targets, the\nequal-division points, and use a new loss function named Line Loss, to improve\nthe speed and accuracy of our approach. In addition to the IDV process, LDRNet\nis an efficient and reliable document localization alternative for all kinds of\nmobile applications. As a matter of proof, we compare the performance of LDRNet\nwith other popular approaches on localizing general document datasets. The\nexperimental results show that LDRNet runs at a speed up to 790 FPS which is\n47x faster, while still achieving comparable Jaccard Index(JI) in single-model\nand single-scale tests.\n","authors":["Han Wu","Holland Qian","Huaming Wu","Aad van Moorsel"],"pdf_url":"https://arxiv.org/pdf/2206.02136v3.pdf","comment":"ECML-PKDD 2022 https://doi.org/10.1007/978-3-031-23618-1_42"},{"id":"http://arxiv.org/abs/2310.08332v1","updated":"2023-10-12T13:46:36Z","published":"2023-10-12T13:46:36Z","title":"Real-Time Neural BRDF with Spherically Distributed Primitives","summary":" We propose a novel compact and efficient neural BRDF offering highly\nversatile material representation, yet with very-light memory and neural\ncomputation consumption towards achieving real-time rendering. The results in\nFigure 1, rendered at full HD resolution on a current desktop machine, show\nthat our system achieves real-time rendering with a wide variety of\nappearances, which is approached by the following two designs. On the one hand,\nnoting that bidirectional reflectance is distributed in a very sparse\nhigh-dimensional subspace, we propose to project the BRDF into two\nlow-dimensional components, i.e., two hemisphere feature-grids for incoming and\noutgoing directions, respectively. On the other hand, learnable neural\nreflectance primitives are distributed on our highly-tailored spherical surface\ngrid, which offer informative features for each component and alleviate the\nconventional heavy feature learning network to a much smaller one, leading to\nvery fast evaluation. These primitives are centrally stored in a codebook and\ncan be shared across multiple grids and even across materials, based on the\nlow-cost indices stored in material-specific spherical surface grids. Our\nneural BRDF, which is agnostic to the material, provides a unified framework\nthat can represent a variety of materials in consistent manner. Comprehensive\nexperimental results on measured BRDF compression, Monte Carlo simulated BRDF\nacceleration, and extension to spatially varying effect demonstrate the\nsuperior quality and generalizability achieved by the proposed scheme.\n","authors":["Yishun Dou","Zhong Zheng","Qiaoqiao Jin","Bingbing Ni","Yugang Chen","Junxiang Ke"],"pdf_url":"https://arxiv.org/pdf/2310.08332v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08326v1","updated":"2023-10-12T13:42:49Z","published":"2023-10-12T13:42:49Z","title":"NSM4D: Neural Scene Model Based Online 4D Point Cloud Sequence\n Understanding","summary":" Understanding 4D point cloud sequences online is of significant practical\nvalue in various scenarios such as VR/AR, robotics, and autonomous driving. The\nkey goal is to continuously analyze the geometry and dynamics of a 3D scene as\nunstructured and redundant point cloud sequences arrive. And the main challenge\nis to effectively model the long-term history while keeping computational costs\nmanageable. To tackle these challenges, we introduce a generic online 4D\nperception paradigm called NSM4D. NSM4D serves as a plug-and-play strategy that\ncan be adapted to existing 4D backbones, significantly enhancing their online\nperception capabilities for both indoor and outdoor scenarios. To efficiently\ncapture the redundant 4D history, we propose a neural scene model that\nfactorizes geometry and motion information by constructing geometry tokens\nseparately storing geometry and motion features. Exploiting the history becomes\nas straightforward as querying the neural scene model. As the sequence\nprogresses, the neural scene model dynamically deforms to align with new\nobservations, effectively providing the historical context and updating itself\nwith the new observations. By employing token representation, NSM4D also\nexhibits robustness to low-level sensor noise and maintains a compact size\nthrough a geometric sampling scheme. We integrate NSM4D with state-of-the-art\n4D perception backbones, demonstrating significant improvements on various\nonline perception benchmarks in indoor and outdoor settings. Notably, we\nachieve a 9.6% accuracy improvement for HOI4D online action segmentation and a\n3.4% mIoU improvement for SemanticKITTI online semantic segmentation.\nFurthermore, we show that NSM4D inherently offers excellent scalability to\nlonger sequences beyond the training set, which is crucial for real-world\napplications.\n","authors":["Yuhao Dong","Zhuoyang Zhang","Yunze Liu","Li Yi"],"pdf_url":"https://arxiv.org/pdf/2310.08326v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08320v1","updated":"2023-10-12T13:33:04Z","published":"2023-10-12T13:33:04Z","title":"Defending Our Privacy With Backdoors","summary":" The proliferation of large AI models trained on uncurated, often sensitive\nweb-scraped data has raised significant privacy concerns. One of the concerns\nis that adversaries can extract information about the training data using\nprivacy attacks. Unfortunately, the task of removing specific information from\nthe models without sacrificing performance is not straightforward and has\nproven to be challenging. We propose a rather easy yet effective defense based\non backdoor attacks to remove private information such as names of individuals\nfrom models, and focus in this work on text encoders. Specifically, through\nstrategic insertion of backdoors, we align the embeddings of sensitive phrases\nwith those of neutral terms-\"a person\" instead of the person's name. Our\nempirical results demonstrate the effectiveness of our backdoor-based defense\non CLIP by assessing its performance using a specialized privacy attack for\nzero-shot classifiers. Our approach provides not only a new \"dual-use\"\nperspective on backdoor attacks, but also presents a promising avenue to\nenhance the privacy of individuals within models trained on uncurated\nweb-scraped data.\n","authors":["Dominik Hintersdorf","Lukas Struppek","Daniel Neider","Kristian Kersting"],"pdf_url":"https://arxiv.org/pdf/2310.08320v1.pdf","comment":"14 pages, 4 figures"},{"id":"http://arxiv.org/abs/2310.08316v1","updated":"2023-10-12T13:27:21Z","published":"2023-10-12T13:27:21Z","title":"Extended target tracking utilizing machine-learning software -- with\n applications to animal classification","summary":" This paper considers the problem of detecting and tracking objects in a\nsequence of images. The problem is formulated in a filtering framework, using\nthe output of object-detection algorithms as measurements. An extension to the\nfiltering formulation is proposed that incorporates class information from the\nprevious frame to robustify the classification, even if the object-detection\nalgorithm outputs an incorrect prediction. Further, the properties of the\nobject-detection algorithm are exploited to quantify the uncertainty of the\nbounding box detection in each frame. The complete filtering method is\nevaluated on camera trap images of the four large Swedish carnivores, bear,\nlynx, wolf, and wolverine. The experiments show that the class tracking\nformulation leads to a more robust classification.\n","authors":["Magnus Malmström","Anton Kullberg","Isaac Skog","Daniel Axehill","Fredrik Gustafsson"],"pdf_url":"https://arxiv.org/pdf/2310.08316v1.pdf","comment":"5 pages, 3 figures"},{"id":"http://arxiv.org/abs/2310.08312v1","updated":"2023-10-12T13:20:17Z","published":"2023-10-12T13:20:17Z","title":"GePSAn: Generative Procedure Step Anticipation in Cooking Videos","summary":" We study the problem of future step anticipation in procedural videos. Given\na video of an ongoing procedural activity, we predict a plausible next\nprocedure step described in rich natural language. While most previous work\nfocus on the problem of data scarcity in procedural video datasets, another\ncore challenge of future anticipation is how to account for multiple plausible\nfuture realizations in natural settings. This problem has been largely\noverlooked in previous work. To address this challenge, we frame future step\nprediction as modelling the distribution of all possible candidates for the\nnext step. Specifically, we design a generative model that takes a series of\nvideo clips as input, and generates multiple plausible and diverse candidates\n(in natural language) for the next step. Following previous work, we side-step\nthe video annotation scarcity by pretraining our model on a large text-based\ncorpus of procedural activities, and then transfer the model to the video\ndomain. Our experiments, both in textual and video domains, show that our model\ncaptures diversity in the next step prediction and generates multiple plausible\nfuture predictions. Moreover, our model establishes new state-of-the-art\nresults on YouCookII, where it outperforms existing baselines on the next step\nanticipation. Finally, we also show that our model can successfully transfer\nfrom text to the video domain zero-shot, ie, without fine-tuning or adaptation,\nand produces good-quality future step predictions from video.\n","authors":["Mohamed Ashraf Abdelsalam","Samrudhdhi B. Rangrej","Isma Hadji","Nikita Dvornik","Konstantinos G. Derpanis","Afsaneh Fazly"],"pdf_url":"https://arxiv.org/pdf/2310.08312v1.pdf","comment":"published at ICCV 2023"},{"id":"http://arxiv.org/abs/2310.08304v1","updated":"2023-10-12T13:11:38Z","published":"2023-10-12T13:11:38Z","title":"CHIP: Contrastive Hierarchical Image Pretraining","summary":" Few-shot object classification is the task of classifying objects in an image\nwith limited number of examples as supervision. We propose a one-shot/few-shot\nclassification model that can classify an object of any unseen class into a\nrelatively general category in an hierarchically based classification. Our\nmodel uses a three-level hierarchical contrastive loss based ResNet152\nclassifier for classifying an object based on its features extracted from Image\nembedding, not used during the training phase. For our experimentation, we have\nused a subset of the ImageNet (ILSVRC-12) dataset that contains only the animal\nclasses for training our model and created our own dataset of unseen classes\nfor evaluating our trained model. Our model provides satisfactory results in\nclassifying the unknown objects into a generic category which has been later\ndiscussed in greater detail.\n","authors":["Arpit Mittal","Harshil Jhaveri","Swapnil Mallick","Abhishek Ajmera"],"pdf_url":"https://arxiv.org/pdf/2310.08304v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08303v1","updated":"2023-10-12T13:09:40Z","published":"2023-10-12T13:09:40Z","title":"Multimodal Variational Auto-encoder based Audio-Visual Segmentation","summary":" We propose an Explicit Conditional Multimodal Variational Auto-Encoder\n(ECMVAE) for audio-visual segmentation (AVS), aiming to segment sound sources\nin the video sequence. Existing AVS methods focus on implicit feature fusion\nstrategies, where models are trained to fit the discrete samples in the\ndataset. With a limited and less diverse dataset, the resulting performance is\nusually unsatisfactory. In contrast, we address this problem from an effective\nrepresentation learning perspective, aiming to model the contribution of each\nmodality explicitly. Specifically, we find that audio contains critical\ncategory information of the sound producers, and visual data provides candidate\nsound producer(s). Their shared information corresponds to the target sound\nproducer(s) shown in the visual data. In this case, cross-modal shared\nrepresentation learning is especially important for AVS. To achieve this, our\nECMVAE factorizes the representations of each modality with a modality-shared\nrepresentation and a modality-specific representation. An orthogonality\nconstraint is applied between the shared and specific representations to\nmaintain the exclusive attribute of the factorized latent code. Further, a\nmutual information maximization regularizer is introduced to achieve extensive\nexploration of each modality. Quantitative and qualitative evaluations on the\nAVSBench demonstrate the effectiveness of our approach, leading to a new\nstate-of-the-art for AVS, with a 3.84 mIOU performance leap on the challenging\nMS3 subset for multiple sound source segmentation.\n","authors":["Yuxin Mao","Jing Zhang","Mochu Xiang","Yiran Zhong","Yuchao Dai"],"pdf_url":"https://arxiv.org/pdf/2310.08303v1.pdf","comment":"Accepted by ICCV2023,Project\n page(https://npucvr.github.io/MMVAE-AVS),Code(https://github.com/OpenNLPLab/MMVAE-AVS)"},{"id":"http://arxiv.org/abs/2308.12435v2","updated":"2023-10-12T12:57:55Z","published":"2023-08-23T21:36:35Z","title":"Characterising representation dynamics in recurrent neural networks for\n object recognition","summary":" Recurrent neural networks (RNNs) have yielded promising results for both\nrecognizing objects in challenging conditions and modeling aspects of primate\nvision. However, the representational dynamics of recurrent computations remain\npoorly understood, especially in large-scale visual models. Here, we studied\nsuch dynamics in RNNs trained for object classification on MiniEcoset, a novel\nsubset of ecoset. We report two main insights. First, upon inference,\nrepresentations continued to evolve after correct classification, suggesting a\nlack of the notion of being ``done with classification''. Second, focusing on\n``readout zones'' as a way to characterize the activation trajectories, we\nobserve that misclassified representations exhibit activation patterns with\nlower L2 norm, and are positioned more peripherally in the readout zones. Such\narrangements help the misclassified representations move into the correct zones\nas time progresses. Our findings generalize to networks with lateral and\ntop-down connections, and include both additive and multiplicative interactions\nwith the bottom-up sweep. The results therefore contribute to a general\nunderstanding of RNN dynamics in naturalistic tasks. We hope that the analysis\nframework will aid future investigations of other types of RNNs, including\nunderstanding of representational dynamics in primate vision.\n","authors":["Sushrut Thorat","Adrien Doerig","Tim C. Kietzmann"],"pdf_url":"https://arxiv.org/pdf/2308.12435v2.pdf","comment":"8 pages, 7 figures; revision of our Conference on Cognitive\n Computational Neuroscience (CCN) 2023 paper"},{"id":"http://arxiv.org/abs/2202.09348v2","updated":"2023-10-12T12:56:28Z","published":"2022-02-18T18:36:01Z","title":"A Machine Learning Paradigm for Studying Pictorial Realism: Are\n Constable's Clouds More Real than His Contemporaries?","summary":" The British landscape painter John Constable is considered foundational for\nthe Realist movement in 19th-century European painting. Constable's painted\nskies, in particular, were seen as remarkably accurate by his contemporaries,\nan impression shared by many viewers today. Yet, assessing the accuracy of\nrealist paintings like Constable's is subjective or intuitive, even for\nprofessional art historians, making it difficult to say with certainty what set\nConstable's skies apart from those of his contemporaries. Our goal is to\ncontribute to a more objective understanding of Constable's realism. We propose\na new machine-learning-based paradigm for studying pictorial realism in an\nexplainable way. Our framework assesses realism by measuring the similarity\nbetween clouds painted by artists noted for their skies, like Constable, and\nphotographs of clouds. The experimental results of cloud classification show\nthat Constable approximates more consistently than his contemporaries the\nformal features of actual clouds in his paintings. The study, as a novel\ninterdisciplinary approach that combines computer vision and machine learning,\nmeteorology, and art history, is a springboard for broader and deeper analyses\nof pictorial realism.\n","authors":["Zhuomin Zhang","Elizabeth C. Mansfield","Jia Li","John Russell","George S. Young","Catherine Adams","James Z. Wang"],"pdf_url":"https://arxiv.org/pdf/2202.09348v2.pdf","comment":"Supplementary materials are available from the authors or\n http://wang.ist.psu.edu"},{"id":"http://arxiv.org/abs/2303.07189v3","updated":"2023-10-12T12:47:22Z","published":"2023-03-13T15:30:28Z","title":"Optimizing Convolutional Neural Networks for Chronic Obstructive\n Pulmonary Disease Detection in Clinical Computed Tomography Imaging","summary":" We aim to optimize the binary detection of Chronic Obstructive Pulmonary\nDisease (COPD) based on emphysema presence in the lung with convolutional\nneural networks (CNN) by exploring manually adjusted versus automated\nwindow-setting optimization (WSO) on computed tomography (CT) images. 7,194 CT\nimages (3,597 with COPD; 3,597 healthy controls) from 78 subjects (43 with\nCOPD; 35 healthy controls) were selected retrospectively (10.2018-12.2019) and\npreprocessed. For each image, intensity values were manually clipped to the\nemphysema window setting and a baseline 'full-range' window setting.\nClass-balanced train, validation, and test sets contained 3,392, 1,114, and\n2,688 images. The network backbone was optimized by comparing various CNN\narchitectures. Furthermore, automated WSO was implemented by adding a\ncustomized layer to the model. The image-level area under the Receiver\nOperating Characteristics curve (AUC) [lower, upper limit 95% confidence] was\nutilized to compare model variations. Repeated inference (n=7) on the test set\nshowed that the DenseNet was the most efficient backbone and achieved a mean\nAUC of 0.80 [0.76, 0.85] without WSO. Comparably, with input images manually\nadjusted to the emphysema window, the DenseNet model predicted COPD with a mean\nAUC of 0.86 [0.82, 0.89]. By adding a customized WSO layer to the DenseNet, an\noptimal window in the proximity of the emphysema window setting was learned\nautomatically, and a mean AUC of 0.82 [0.78, 0.86] was achieved. Detection of\nCOPD with DenseNet models was improved by WSO of CT data to the emphysema\nwindow setting range.\n","authors":["Tina Dorosti","Manuel Schultheiss","Felix Hofmann","Johannes Thalhammer","Luisa Kirchner","Theresa Urban","Franz Pfeiffer","Florian Schaff","Tobias Lasser","Daniela Pfeiffer"],"pdf_url":"https://arxiv.org/pdf/2303.07189v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.01601v3","updated":"2023-10-12T12:43:39Z","published":"2023-04-04T07:43:56Z","title":"Primitive Simultaneous Optimization of Similarity Metrics for Image\n Registration","summary":" Even though simultaneous optimization of similarity metrics is a standard\nprocedure in the field of semantic segmentation, surprisingly, this is much\nless established for image registration. To help closing this gap in the\nliterature, we investigate in a complex multi-modal 3D setting whether\nsimultaneous optimization of registration metrics, here implemented by means of\nprimitive summation, can benefit image registration. We evaluate two\nchallenging datasets containing collections of pre- to post-operative and pre-\nto intra-operative MR images of glioma. Employing the proposed optimization, we\ndemonstrate improved registration accuracy in terms of TRE on expert\nneuroradiologists' landmark annotations.\n","authors":["Diana Waldmannstetter","Benedikt Wiestler","Julian Schwarting","Ivan Ezhov","Marie Metz","Spyridon Bakas","Bhakti Baheti","Satrajit Chakrabarty","Daniel Rueckert","Jan S. Kirschke","Rolf A. Heckemann","Marie Piraud","Bjoern H. Menze","Florian Kofler"],"pdf_url":"https://arxiv.org/pdf/2304.01601v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08276v1","updated":"2023-10-12T12:28:47Z","published":"2023-10-12T12:28:47Z","title":"Direction-Oriented Visual-semantic Embedding Model for Remote Sensing\n Image-text Retrieval","summary":" Image-text retrieval has developed rapidly in recent years. However, it is\nstill a challenge in remote sensing due to visual-semantic imbalance, which\nleads to incorrect matching of non-semantic visual and textual features. To\nsolve this problem, we propose a novel Direction-Oriented Visual-semantic\nEmbedding Model (DOVE) to mine the relationship between vision and language.\nConcretely, a Regional-Oriented Attention Module (ROAM) adaptively adjusts the\ndistance between the final visual and textual embeddings in the latent semantic\nspace, oriented by regional visual features. Meanwhile, a lightweight Digging\nText Genome Assistant (DTGA) is designed to expand the range of tractable\ntextual representation and enhance global word-level semantic connections using\nless attention operations. Ultimately, we exploit a global visual-semantic\nconstraint to reduce single visual dependency and serve as an external\nconstraint for the final visual and textual representations. The effectiveness\nand superiority of our method are verified by extensive experiments including\nparameter evaluation, quantitative comparison, ablation studies and visual\nanalysis, on two benchmark datasets, RSICD and RSITMD.\n","authors":["Qing Ma","Jiancheng Pan","Cong Bai"],"pdf_url":"https://arxiv.org/pdf/2310.08276v1.pdf","comment":"13 pages, 11 figures"},{"id":"http://arxiv.org/abs/2310.08261v1","updated":"2023-10-12T12:06:31Z","published":"2023-10-12T12:06:31Z","title":"GraphAlign: Enhancing Accurate Feature Alignment by Graph matching for\n Multi-Modal 3D Object Detection","summary":" LiDAR and cameras are complementary sensors for 3D object detection in\nautonomous driving. However, it is challenging to explore the unnatural\ninteraction between point clouds and images, and the critical factor is how to\nconduct feature alignment of heterogeneous modalities. Currently, many methods\nachieve feature alignment by projection calibration only, without considering\nthe problem of coordinate conversion accuracy errors between sensors, leading\nto sub-optimal performance. In this paper, we present GraphAlign, a more\naccurate feature alignment strategy for 3D object detection by graph matching.\nSpecifically, we fuse image features from a semantic segmentation encoder in\nthe image branch and point cloud features from a 3D Sparse CNN in the LiDAR\nbranch. To save computation, we construct the nearest neighbor relationship by\ncalculating Euclidean distance within the subspaces that are divided into the\npoint cloud features. Through the projection calibration between the image and\npoint cloud, we project the nearest neighbors of point cloud features onto the\nimage features. Then by matching the nearest neighbors with a single point\ncloud to multiple images, we search for a more appropriate feature alignment.\nIn addition, we provide a self-attention module to enhance the weights of\nsignificant relations to fine-tune the feature alignment between heterogeneous\nmodalities. Extensive experiments on nuScenes benchmark demonstrate the\neffectiveness and efficiency of our GraphAlign.\n","authors":["Ziying Song","Haiyue Wei","Lin Bai","Lei Yang","Caiyan Jia"],"pdf_url":"https://arxiv.org/pdf/2310.08261v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08259v1","updated":"2023-10-12T12:05:51Z","published":"2023-10-12T12:05:51Z","title":"Invisible Threats: Backdoor Attack in OCR Systems","summary":" Optical Character Recognition (OCR) is a widely used tool to extract text\nfrom scanned documents. Today, the state-of-the-art is achieved by exploiting\ndeep neural networks. However, the cost of this performance is paid at the\nprice of system vulnerability. For instance, in backdoor attacks, attackers\ncompromise the training phase by inserting a backdoor in the victim's model\nthat will be activated at testing time by specific patterns while leaving the\noverall model performance intact. This work proposes a backdoor attack for OCR\nresulting in the injection of non-readable characters from malicious input\nimages. This simple but effective attack exposes the state-of-the-art OCR\nweakness, making the extracted text correct to human eyes but simultaneously\nunusable for the NLP application that uses OCR as a preprocessing step.\nExperimental results show that the attacked models successfully output\nnon-readable characters for around 90% of the poisoned instances without\nharming their performance for the remaining instances.\n","authors":["Mauro Conti","Nicola Farronato","Stefanos Koffas","Luca Pajola","Stjepan Picek"],"pdf_url":"https://arxiv.org/pdf/2310.08259v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.04825v2","updated":"2023-10-12T12:05:15Z","published":"2023-10-07T14:29:57Z","title":"Comparative study of multi-person tracking methods","summary":" This paper presents a study of two tracking algorithms (SORT~\\cite{7533003}\nand Tracktor++~\\cite{2019}) that were ranked first positions on the MOT\nChallenge leaderboard (The MOTChallenge web page: https://motchallenge.net ).\nThe purpose of this study is to discover the techniques used and to provide\nuseful insights about these algorithms in the tracking pipeline that could\nimprove the performance of MOT tracking algorithms. To this end, we adopted the\npopular tracking-by-detection approach. We trained our own Pedestrian Detection\nmodel using the MOT17Det dataset (MOT17Det :\nhttps://motchallenge.net/data/MOT17Det/ ). We also used a re-identification\nmodel trained on MOT17 dataset (MOT17 : https://motchallenge.net/data/MOT17/ )\nfor Tracktor++ to reduce the false re-identification alarms. We then present\nexperimental results which shows that Tracktor++ is a better multi-person\ntracking algorithm than SORT. We also performed ablation studies to discover\nthe contribution of re-identification(RE-ID) network and motion to the results\nof Tracktor++. We finally conclude by providing some recommendations for future\nresearch.\n","authors":["Denis Mbey Akola"],"pdf_url":"https://arxiv.org/pdf/2310.04825v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.04829v2","updated":"2023-10-12T12:04:16Z","published":"2023-10-07T14:38:16Z","title":"How to effectively train an ensemble of Faster R-CNN object detectors to\n quantify uncertainty","summary":" This paper presents a new approach for training two-stage object detection\nensemble models, more specifically, Faster R-CNN models to estimate\nuncertainty. We propose training one Region Proposal\nNetwork(RPN)~\\cite{https://doi.org/10.48550/arxiv.1506.01497} and multiple Fast\nR-CNN prediction heads is all you need to build a robust deep ensemble network\nfor estimating uncertainty in object detection. We present this approach and\nprovide experiments to show that this approach is much faster than the naive\nmethod of fully training all $n$ models in an ensemble. We also estimate the\nuncertainty by measuring this ensemble model's Expected Calibration Error\n(ECE). We then further compare the performance of this model with that of\nGaussian YOLOv3, a variant of YOLOv3 that models uncertainty using predicted\nbounding box coordinates. The source code is released at\n\\url{https://github.com/Akola-Mbey-Denis/EfficientEnsemble}\n","authors":["Denis Mbey Akola","Gianni Franchi"],"pdf_url":"https://arxiv.org/pdf/2310.04829v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.19443v2","updated":"2023-10-12T12:02:34Z","published":"2023-05-30T22:34:48Z","title":"OWAdapt: An adaptive loss function for deep learning using OWA operators","summary":" In this paper, we propose a fuzzy adaptive loss function for enhancing deep\nlearning performance in classification tasks. Specifically, we redefine the\ncross-entropy loss to effectively address class-level noise conditions,\nincluding the challenging problem of class imbalance. Our approach introduces\naggregation operators, leveraging the power of fuzzy logic to improve\nclassification accuracy. The rationale behind our proposed method lies in the\niterative up-weighting of class-level components within the loss function,\nfocusing on those with larger errors. To achieve this, we employ the ordered\nweighted average (OWA) operator and combine it with an adaptive scheme for\ngradient-based learning. Through extensive experimentation, our method\noutperforms other commonly used loss functions, such as the standard\ncross-entropy or focal loss, across various binary and multiclass\nclassification tasks. Furthermore, we explore the influence of hyperparameters\nassociated with the OWA operators and present a default configuration that\nperforms well across different experimental settings.\n","authors":["Sebastián Maldonado","Carla Vairetti","Katherine Jara","Miguel Carrasco","Julio López"],"pdf_url":"https://arxiv.org/pdf/2305.19443v2.pdf","comment":"15 pages, 1 figure, published"},{"id":"http://arxiv.org/abs/2310.08255v1","updated":"2023-10-12T11:59:54Z","published":"2023-10-12T11:59:54Z","title":"Distilling from Vision-Language Models for Improved OOD Generalization\n in Vision Tasks","summary":" Vision-Language Models (VLMs) such as CLIP are trained on large amounts of\nimage-text pairs, resulting in remarkable generalization across several data\ndistributions. The prohibitively expensive training and data\ncollection/curation costs of these models make them valuable Intellectual\nProperty (IP) for organizations. This motivates a vendor-client paradigm, where\na vendor trains a large-scale VLM and grants only input-output access to\nclients on a pay-per-query basis in a black-box setting. The client aims to\nminimize inference cost by distilling the VLM to a student model using the\nlimited available task-specific data, and further deploying this student model\nin the downstream application. While naive distillation largely improves the\nIn-Domain (ID) accuracy of the student, it fails to transfer the superior\nout-of-distribution (OOD) generalization of the VLM teacher using the limited\navailable labeled images. To mitigate this, we propose Vision-Language to\nVision-Align, Distill, Predict (VL2V-ADiP), which first aligns the vision and\nlanguage modalities of the teacher model with the vision modality of a\npre-trained student model, and further distills the aligned VLM embeddings to\nthe student. This maximally retains the pre-trained features of the student,\nwhile also incorporating the rich representations of the VLM image encoder and\nthe superior generalization of the text embeddings. The proposed approach\nachieves state-of-the-art results on the standard Domain Generalization\nbenchmarks in a black-box teacher setting, and also when weights of the VLM are\naccessible.\n","authors":["Sravanti Addepalli","Ashish Ramayee Asokan","Lakshay Sharma","R. Venkatesh Babu"],"pdf_url":"https://arxiv.org/pdf/2310.08255v1.pdf","comment":"Code is available at https://github.com/val-iisc/VL2V-ADiP.git"},{"id":"http://arxiv.org/abs/2309.06188v2","updated":"2023-10-12T11:51:21Z","published":"2023-09-12T12:54:12Z","title":"Computer Vision Pipeline for Automated Antarctic Krill Analysis","summary":" British Antarctic Survey (BAS) researchers launch annual expeditions to the\nAntarctic in order to estimate Antarctic Krill biomass and assess the change\nfrom previous years. These comparisons provide insight into the effects of the\ncurrent environment on this key component of the marine food chain. In this\nwork we have developed tools for automating the data collection and analysis\nprocess, using web-based image annotation tools and deep learning image\nclassification and regression models. We achieve highly accurate krill instance\nsegmentation results with an average 77.28% AP score, as well as separate\nmaturity stage and length estimation of krill specimens with 62.99% accuracy\nand a 1.98mm length error respectively.\n","authors":["Mazvydas Gudelis","Michal Mackiewicz","Julie Bremner","Sophie Fielding"],"pdf_url":"https://arxiv.org/pdf/2309.06188v2.pdf","comment":"Accepted to MVEO @ BMVC 2023"},{"id":"http://arxiv.org/abs/2308.03998v4","updated":"2023-10-12T11:49:34Z","published":"2023-08-08T02:28:48Z","title":"Real-time Strawberry Detection Based on Improved YOLOv5s Architecture\n for Robotic Harvesting in open-field environment","summary":" This study proposed a YOLOv5-based custom object detection model to detect\nstrawberries in an outdoor environment. The original architecture of the\nYOLOv5s was modified by replacing the C3 module with the C2f module in the\nbackbone network, which provided a better feature gradient flow. Secondly, the\nSpatial Pyramid Pooling Fast in the final layer of the backbone network of\nYOLOv5s was combined with Cross Stage Partial Net to improve the generalization\nability over the strawberry dataset in this study. The proposed architecture\nwas named YOLOv5s-Straw. The RGB images dataset of the strawberry canopy with\nthree maturity classes (immature, nearly mature, and mature) was collected in\nopen-field environment and augmented through a series of operations including\nbrightness reduction, brightness increase, and noise adding. To verify the\nsuperiority of the proposed method for strawberry detection in open-field\nenvironment, four competitive detection models (YOLOv3-tiny, YOLOv5s,\nYOLOv5s-C2f, and YOLOv8s) were trained, and tested under the same computational\nenvironment and compared with YOLOv5s-Straw. The results showed that the\nhighest mean average precision of 80.3% was achieved using the proposed\narchitecture whereas the same was achieved with YOLOv3-tiny, YOLOv5s,\nYOLOv5s-C2f, and YOLOv8s were 73.4%, 77.8%, 79.8%, 79.3%, respectively.\nSpecifically, the average precision of YOLOv5s-Straw was 82.1% in the immature\nclass, 73.5% in the nearly mature class, and 86.6% in the mature class, which\nwere 2.3% and 3.7%, respectively, higher than that of the latest YOLOv8s. The\nmodel included 8.6*10^6 network parameters with an inference speed of 18ms per\nimage while the inference speed of YOLOv8s had a slower inference speed of\n21.0ms and heavy parameters of 11.1*10^6, which indicates that the proposed\nmodel is fast enough for real time strawberry detection and localization for\nthe robotic picking.\n","authors":["Zixuan He","Salik Ram Khanal","Xin Zhang","Manoj Karkee","Qin Zhang"],"pdf_url":"https://arxiv.org/pdf/2308.03998v4.pdf","comment":"20 pages; 15 figures"},{"id":"http://arxiv.org/abs/2310.08230v1","updated":"2023-10-12T11:23:07Z","published":"2023-10-12T11:23:07Z","title":"Fast Discrete Optimisation for Geometrically Consistent 3D Shape\n Matching","summary":" In this work we propose to combine the advantages of learning-based and\ncombinatorial formalisms for 3D shape matching. While learning-based shape\nmatching solutions lead to state-of-the-art matching performance, they do not\nensure geometric consistency, so that obtained matchings are locally unsmooth.\nOn the contrary, axiomatic methods allow to take geometric consistency into\naccount by explicitly constraining the space of valid matchings. However,\nexisting axiomatic formalisms are impractical since they do not scale to\npractically relevant problem sizes, or they require user input for the\ninitialisation of non-convex optimisation problems. In this work we aim to\nclose this gap by proposing a novel combinatorial solver that combines a unique\nset of favourable properties: our approach is (i) initialisation free, (ii)\nmassively parallelisable powered by a quasi-Newton method, (iii) provides\noptimality gaps, and (iv) delivers decreased runtime and globally optimal\nresults for many instances.\n","authors":["Paul Roetzer","Ahmed Abbas","Dongliang Cao","Florian Bernard","Paul Swoboda"],"pdf_url":"https://arxiv.org/pdf/2310.08230v1.pdf","comment":"Paul Roetzer and Ahmed Abbas contributed equally"},{"id":"http://arxiv.org/abs/2310.08222v1","updated":"2023-10-12T11:14:27Z","published":"2023-10-12T11:14:27Z","title":"Structural analysis of Hindi online handwritten characters for character\n recognition","summary":" Direction properties of online strokes are used to analyze them in terms of\nhomogeneous regions or sub-strokes with points satisfying common geometric\nproperties. Such sub-strokes are called sub-units. These properties are used to\nextract sub-units from Hindi ideal online characters. These properties along\nwith some heuristics are used to extract sub-units from Hindi online\nhandwritten characters.\\\\ A method is developed to extract point stroke,\nclockwise curve stroke, counter-clockwise curve stroke and loop stroke segments\nas sub-units from Hindi online handwritten characters. These extracted\nsub-units are close in structure to the sub-units of the corresponding Hindi\nonline ideal characters.\\\\ Importance of local representation of online\nhandwritten characters in terms of sub-units is assessed by training a\nclassifier with sub-unit level local and character level global features\nextracted from characters for character recognition. The classifier has the\nrecognition accuracy of 93.5\\% on the testing set. This accuracy is the highest\nwhen compared with that of the classifiers trained only with global features\nextracted from characters in the same training set and evaluated on the same\ntesting set.\\\\ Sub-unit extraction algorithm and the sub-unit based character\nclassifier are tested on Hindi online handwritten character dataset. This\ndataset consists of samples from 96 different characters. There are 12832 and\n2821 samples in the training and testing sets, respectively.\n","authors":["Anand Sharma","A. G. Ramakrishnan"],"pdf_url":"https://arxiv.org/pdf/2310.08222v1.pdf","comment":"34 pages, 36 jpg figures"},{"id":"http://arxiv.org/abs/2309.00848v2","updated":"2023-10-12T11:11:23Z","published":"2023-09-02T07:17:43Z","title":"Bengali Document Layout Analysis -- A YOLOV8 Based Ensembling Approach","summary":" This paper focuses on enhancing Bengali Document Layout Analysis (DLA) using\nthe YOLOv8 model and innovative post-processing techniques. We tackle\nchallenges unique to the complex Bengali script by employing data augmentation\nfor model robustness. After meticulous validation set evaluation, we fine-tune\nour approach on the complete dataset, leading to a two-stage prediction\nstrategy for accurate element segmentation. Our ensemble model, combined with\npost-processing, outperforms individual base architectures, addressing issues\nidentified in the BaDLAD dataset. By leveraging this approach, we aim to\nadvance Bengali document analysis, contributing to improved OCR and document\ncomprehension and BaDLAD serves as a foundational resource for this endeavor,\naiding future research in the field. Furthermore, our experiments provided key\ninsights to incorporate new strategies into the established solution.\n","authors":["Nazmus Sakib Ahmed","Saad Sakib Noor","Ashraful Islam Shanto Sikder","Abhijit Paul"],"pdf_url":"https://arxiv.org/pdf/2309.00848v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08217v1","updated":"2023-10-12T11:05:34Z","published":"2023-10-12T11:05:34Z","title":"TriRE: A Multi-Mechanism Learning Paradigm for Continual Knowledge\n Retention and Promotion","summary":" Continual learning (CL) has remained a persistent challenge for deep neural\nnetworks due to catastrophic forgetting (CF) of previously learned tasks.\nSeveral techniques such as weight regularization, experience rehearsal, and\nparameter isolation have been proposed to alleviate CF. Despite their relative\nsuccess, these research directions have predominantly remained orthogonal and\nsuffer from several shortcomings, while missing out on the advantages of\ncompeting strategies. On the contrary, the brain continually learns,\naccommodates, and transfers knowledge across tasks by simultaneously leveraging\nseveral neurophysiological processes, including neurogenesis, active\nforgetting, neuromodulation, metaplasticity, experience rehearsal, and\ncontext-dependent gating, rarely resulting in CF. Inspired by how the brain\nexploits multiple mechanisms concurrently, we propose TriRE, a novel CL\nparadigm that encompasses retaining the most prominent neurons for each task,\nrevising and solidifying the extracted knowledge of current and past tasks, and\nactively promoting less active neurons for subsequent tasks through rewinding\nand relearning. Across CL settings, TriRE significantly reduces task\ninterference and surpasses different CL approaches considered in isolation.\n","authors":["Preetha Vijayan","Prashant Bhat","Elahe Arani","Bahram Zonooz"],"pdf_url":"https://arxiv.org/pdf/2310.08217v1.pdf","comment":"Accepted at 37th Conference on Neural Information Processing Systems\n (NeurIPS 2023)"},{"id":"http://arxiv.org/abs/2310.08206v1","updated":"2023-10-12T10:51:23Z","published":"2023-10-12T10:51:23Z","title":"Long-Tailed Classification Based on Coarse-Grained Leading Forest and\n Multi-Center Loss","summary":" Long-tailed(LT) classification is an unavoidable and challenging problem in\nthe real world. Most of the existing long-tailed classification methods focus\nonly on solving the inter-class imbalance in which there are more samples in\nthe head class than in the tail class, while ignoring the intra-lass imbalance\nin which the number of samples of the head attribute within the same class is\nmuch larger than the number of samples of the tail attribute. The deviation in\nthe model is caused by both of these factors, and due to the fact that\nattributes are implicit in most datasets and the combination of attributes is\nvery complex, the intra-class imbalance is more difficult to handle. For this\npurpose, we proposed a long-tailed classification framework, known as\n\\textbf{\\textsc{Cognisance}}, which is founded on Coarse-Grained Leading Forest\n(CLF) and Multi-Center Loss (MCL), aiming to build a multi-granularity joint\nsolution model by means of invariant feature learning. In this method, we\ndesigned an unsupervised learning method, i.e., CLF, to better characterize the\ndistribution of attributes within a class. Depending on the distribution of\nattributes, we can flexibly construct sampling strategies suitable for\ndifferent environments. In addition, we introduce a new metric learning loss\n(MCL), which aims to gradually eliminate confusing attributes during the\nfeature learning process. More importantly, this approach does not depend on a\nspecific model structure and can be integrated with existing LT methods as an\nindependent component. We have conducted extensive experiments and our approach\nhas state-of-the-art performance in both existing benchmarks ImageNet-GLT and\nMSCOCO-GLT, and can improve the performance of existing LT methods. Our codes\nare available on GitHub: \\url{https://github.com/jinyery/cognisance}\n","authors":["Jinye Yang","Ji Xu"],"pdf_url":"https://arxiv.org/pdf/2310.08206v1.pdf","comment":"This is another research work to apply leading tree structure along\n with deep learning architecture"},{"id":"http://arxiv.org/abs/2310.08204v1","updated":"2023-10-12T10:50:21Z","published":"2023-10-12T10:50:21Z","title":"Lifelong Audio-video Masked Autoencoder with Forget-robust Localized\n Alignments","summary":" We present a lifelong audio-video masked autoencoder that continually learns\nthe multimodal representations from a video stream containing audio-video\npairs, while its distribution continually shifts over time. Specifically, we\npropose two novel ideas to tackle the problem: (1) Localized Alignment: We\nintroduce a small trainable multimodal encoder that predicts the audio and\nvideo tokens that are well-aligned with each other. This allows the model to\nlearn only the highly correlated audiovisual patches with accurate multimodal\nrelationships. (2) Forget-robust multimodal patch selection: We compare the\nrelative importance of each audio-video patch between the current and past data\npair to mitigate unintended drift of the previously learned audio-video\nrepresentations. Our proposed method, FLAVA (Forget-robust Localized\nAudio-Video Alignment), therefore, captures the complex relationships between\nthe audio and video modalities during training on a sequence of pre-training\ntasks while alleviating the forgetting of learned audiovisual correlations. Our\nexperiments validate that FLAVA outperforms the state-of-the-art continual\nlearning methods on several benchmark datasets under continual audio-video\nrepresentation learning scenarios.\n","authors":["Jaewoo Lee","Jaehong Yoon","Wonjae Kim","Yunji Kim","Sung Ju Hwang"],"pdf_url":"https://arxiv.org/pdf/2310.08204v1.pdf","comment":"Preprint, project page: https://g-jwlee.github.io/FLAVA/"},{"id":"http://arxiv.org/abs/2310.05969v2","updated":"2023-10-12T10:26:17Z","published":"2023-09-28T07:57:03Z","title":"Automated Chest X-Ray Report Generator Using Multi-Model Deep Learning\n Approach","summary":" Reading and interpreting chest X-ray images is one of the most radiologist's\nroutines. However, it still can be challenging, even for the most experienced\nones. Therefore, we proposed a multi-model deep learning-based automated chest\nX-ray report generator system designed to assist radiologists in their work.\nThe basic idea of the proposed system is by utilizing multi\nbinary-classification models for detecting multi abnormalities, with each model\nresponsible for detecting one abnormality, in a single image. In this study, we\nlimited the radiology abnormalities detection to only cardiomegaly, lung\neffusion, and consolidation. The system generates a radiology report by\nperforming the following three steps: image pre-processing, utilizing deep\nlearning models to detect abnormalities, and producing a report. The aim of the\nimage pre-processing step is to standardize the input by scaling it to 128x128\npixels and slicing it into three segments, which covers the upper, lower, and\nmiddle parts of the lung. After pre-processing, each corresponding model\nclassifies the image, resulting in a 0 (zero) for no abnormality detected and a\n1 (one) for the presence of an abnormality. The prediction outputs of each\nmodel are then concatenated to form a 'result code'. The 'result code' is used\nto construct a report by selecting the appropriate pre-determined sentence for\neach detected abnormality in the report generation step. The proposed system is\nexpected to reduce the workload of radiologists and increase the accuracy of\nchest X-ray diagnosis.\n","authors":["Arief Purnama Muharram","Hollyana Puteri Haryono","Abassi Haji Juma","Ira Puspasari","Nugraha Priya Utama"],"pdf_url":"https://arxiv.org/pdf/2310.05969v2.pdf","comment":"Presented in the 2023 IEEE International Conference on Data and\n Software Engineering (ICoDSE 2023)"},{"id":"http://arxiv.org/abs/2309.13336v2","updated":"2023-10-12T10:24:42Z","published":"2023-09-23T10:58:08Z","title":"FedDrive v2: an Analysis of the Impact of Label Skewness in Federated\n Semantic Segmentation for Autonomous Driving","summary":" We propose FedDrive v2, an extension of the Federated Learning benchmark for\nSemantic Segmentation in Autonomous Driving. While the first version aims at\nstudying the effect of domain shift of the visual features across clients, in\nthis work, we focus on the distribution skewness of the labels. We propose six\nnew federated scenarios to investigate how label skewness affects the\nperformance of segmentation models and compare it with the effect of domain\nshift. Finally, we study the impact of using the domain information during\ntesting. Official website: https://feddrive.github.io\n","authors":["Eros Fanì","Marco Ciccone","Barbara Caputo"],"pdf_url":"https://arxiv.org/pdf/2309.13336v2.pdf","comment":"5th Italian Conference on Robotics and Intelligent Machines (I-RIM)\n 2023"},{"id":"http://arxiv.org/abs/2310.08182v1","updated":"2023-10-12T10:17:40Z","published":"2023-10-12T10:17:40Z","title":"XIMAGENET-12: An Explainable AI Benchmark Dataset for Model Robustness\n Evaluation","summary":" The lack of standardized robustness metrics and the widespread reliance on\nnumerous unrelated benchmark datasets for testing have created a gap between\nacademically validated robust models and their often problematic practical\nadoption. To address this, we introduce XIMAGENET-12, an explainable benchmark\ndataset with over 200K images and 15,600 manual semantic annotations. Covering\n12 categories from ImageNet to represent objects commonly encountered in\npractical life and simulating six diverse scenarios, including overexposure,\nblurring, color changing, etc., we further propose a novel robustness criterion\nthat extends beyond model generation ability assessment. This benchmark\ndataset, along with related code, is available at\nhttps://sites.google.com/view/ximagenet-12/home. Researchers and practitioners\ncan leverage this resource to evaluate the robustness of their visual models\nunder challenging conditions and ultimately benefit from the demands of\npractical computer vision systems.\n","authors":["Qiang Li","Dan Zhang","Shengzhao Lei","Xun Zhao","Shuyan Li","Porawit Kamnoedboon","WeiWei Li"],"pdf_url":"https://arxiv.org/pdf/2310.08182v1.pdf","comment":"UnderSubmission"},{"id":"http://arxiv.org/abs/2310.07449v2","updated":"2023-10-12T10:14:39Z","published":"2023-10-11T12:51:16Z","title":"PoRF: Pose Residual Field for Accurate Neural Surface Reconstruction","summary":" Neural surface reconstruction is sensitive to the camera pose noise, even if\nstate-of-the-art pose estimators like COLMAP or ARKit are used. More\nimportantly, existing Pose-NeRF joint optimisation methods have struggled to\nimprove pose accuracy in challenging real-world scenarios. To overcome the\nchallenges, we introduce the pose residual field (\\textbf{PoRF}), a novel\nimplicit representation that uses an MLP for regressing pose updates. This is\nmore robust than the conventional pose parameter optimisation due to parameter\nsharing that leverages global information over the entire sequence.\nFurthermore, we propose an epipolar geometry loss to enhance the supervision\nthat leverages the correspondences exported from COLMAP results without the\nextra computational overhead. Our method yields promising results. On the DTU\ndataset, we reduce the rotation error by 78\\% for COLMAP poses, leading to the\ndecreased reconstruction Chamfer distance from 3.48mm to 0.85mm. On the\nMobileBrick dataset that contains casually captured unbounded 360-degree\nvideos, our method refines ARKit poses and improves the reconstruction F1 score\nfrom 69.18 to 75.67, outperforming that with the dataset provided ground-truth\npose (75.14). These achievements demonstrate the efficacy of our approach in\nrefining camera poses and improving the accuracy of neural surface\nreconstruction in real-world scenarios.\n","authors":["Jia-Wang Bian","Wenjing Bian","Victor Adrian Prisacariu","Philip Torr"],"pdf_url":"https://arxiv.org/pdf/2310.07449v2.pdf","comment":"Under review"},{"id":"http://arxiv.org/abs/2310.08177v1","updated":"2023-10-12T10:03:25Z","published":"2023-10-12T10:03:25Z","title":"Improving Fast Minimum-Norm Attacks with Hyperparameter Optimization","summary":" Evaluating the adversarial robustness of machine learning models using\ngradient-based attacks is challenging. In this work, we show that\nhyperparameter optimization can improve fast minimum-norm attacks by automating\nthe selection of the loss function, the optimizer and the step-size scheduler,\nalong with the corresponding hyperparameters. Our extensive evaluation\ninvolving several robust models demonstrates the improved efficacy of fast\nminimum-norm attacks when hyper-up with hyperparameter optimization. We release\nour open-source code at https://github.com/pralab/HO-FMN.\n","authors":["Giuseppe Floris","Raffaele Mura","Luca Scionis","Giorgio Piras","Maura Pintor","Ambra Demontis","Battista Biggio"],"pdf_url":"https://arxiv.org/pdf/2310.08177v1.pdf","comment":"Accepted at ESANN23"},{"id":"http://arxiv.org/abs/2310.08165v1","updated":"2023-10-12T09:37:56Z","published":"2023-10-12T09:37:56Z","title":"COVID-19 Detection Using Swin Transformer Approach from Computed\n Tomography Images","summary":" The accurate and efficient diagnosis of COVID-19 is of paramount importance,\nparticularly in the context of large-scale medical imaging datasets. In this\npreprint paper, we propose a novel approach for COVID-19 diagnosis using CT\nimages that leverages the power of Swin Transformer models, state-of-the-art\nsolutions in computer vision tasks. Our method includes a systematic approach\nfor patient-level predictions, where individual CT slices are classified as\nCOVID-19 or non-COVID, and the patient's overall diagnosis is determined\nthrough majority voting. The application of the Swin Transformer in this\ncontext results in patient-level predictions that demonstrate exceptional\ndiagnostic accuracy. In terms of evaluation metrics, our approach consistently\noutperforms the baseline, as well as numerous competing methods, showcasing its\neffectiveness in COVID-19 diagnosis. The macro F1 score achieved by our model\nexceeds the baseline and offers a robust solution for accurate diagnosis.\n","authors":["Kenan Morani"],"pdf_url":"https://arxiv.org/pdf/2310.08165v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.17046v2","updated":"2023-10-12T09:22:10Z","published":"2023-06-29T15:43:06Z","title":"Spiking Denoising Diffusion Probabilistic Models","summary":" Spiking neural networks (SNNs) have ultra-low energy consumption and high\nbiological plausibility due to their binary and bio-driven nature compared with\nartificial neural networks (ANNs). While previous research has primarily\nfocused on enhancing the performance of SNNs in classification tasks, the\ngenerative potential of SNNs remains relatively unexplored. In our paper, we\nput forward Spiking Denoising Diffusion Probabilistic Models (SDDPM), a new\nclass of SNN-based generative models that achieve high sample quality. To fully\nexploit the energy efficiency of SNNs, we propose a purely Spiking U-Net\narchitecture, which achieves comparable performance to its ANN counterpart\nusing only 4 time steps, resulting in significantly reduced energy consumption.\nExtensive experimental results reveal that our approach achieves\nstate-of-the-art on the generative tasks and substantially outperforms other\nSNN-based generative models, achieving up to $12\\times$ and $6\\times$\nimprovement on the CIFAR-10 and the CelebA datasets, respectively. Moreover, we\npropose a threshold-guided strategy that can further improve the performances\nby 16.7% in a training-free manner. The SDDPM symbolizes a significant\nadvancement in the field of SNN generation, injecting new perspectives and\npotential avenues of exploration.\n","authors":["Jiahang Cao","Ziqing Wang","Hanzhong Guo","Hao Cheng","Qiang Zhang","Renjing Xu"],"pdf_url":"https://arxiv.org/pdf/2306.17046v2.pdf","comment":"Under Review"},{"id":"http://arxiv.org/abs/2207.09339v3","updated":"2023-10-12T09:13:37Z","published":"2022-07-19T15:49:35Z","title":"Vision Transformers: From Semantic Segmentation to Dense Prediction","summary":" The emergence of vision transformers (ViTs) in image classification has\nshifted the methodologies for visual representation learning. In particular,\nViTs learn visual representation at full receptive field per layer across all\nthe image patches, in comparison to the increasing receptive fields of CNNs\nacross layers and other alternatives (e.g., large kernels and atrous\nconvolution). In this work, for the first time we explore the global context\nlearning potentials of ViTs for dense visual prediction (e.g., semantic\nsegmentation). Our motivation is that through learning global context at full\nreceptive field layer by layer, ViTs may capture stronger long-range dependency\ninformation, critical for dense prediction tasks. We first demonstrate that\nencoding an image as a sequence of patches, a vanilla ViT without local\nconvolution and resolution reduction can yield stronger visual representation\nfor semantic segmentation. For example, our model, termed as SEgmentation\nTRansformer (SETR), excels on ADE20K (50.28% mIoU, the first position in the\ntest leaderboard on the day of submission) and Pascal Context (55.83% mIoU),\nand performs competitively on Cityscapes. For tackling general dense visual\nprediction tasks in a cost-effective manner, we further formulate a family of\nHierarchical Local-Global (HLG) Transformers, characterized by local attention\nwithin windows and global-attention across windows in a pyramidal architecture.\nExtensive experiments show that our methods achieve appealing performance on a\nvariety of dense prediction tasks (e.g., object detection and instance\nsegmentation and semantic segmentation) as well as image classification. Our\ncode and models are available at https://github.com/fudan-zvg/SETR.\n","authors":["Li Zhang","Jiachen Lu","Sixiao Zheng","Xinxuan Zhao","Xiatian Zhu","Yanwei Fu","Tao Xiang","Jianfeng Feng","Philip H. S. Torr"],"pdf_url":"https://arxiv.org/pdf/2207.09339v3.pdf","comment":"Extended version of CVPR 2021 paper arXiv:2012.15840"},{"id":"http://arxiv.org/abs/2301.12082v3","updated":"2023-10-12T09:01:04Z","published":"2023-01-28T03:58:32Z","title":"Pushing the Limits of Fewshot Anomaly Detection in Industry Vision:\n Graphcore","summary":" In the area of fewshot anomaly detection (FSAD), efficient visual feature\nplays an essential role in memory bank M-based methods. However, these methods\ndo not account for the relationship between the visual feature and its rotated\nvisual feature, drastically limiting the anomaly detection performance. To push\nthe limits, we reveal that rotation-invariant feature property has a\nsignificant impact in industrial-based FSAD. Specifically, we utilize graph\nrepresentation in FSAD and provide a novel visual isometric invariant feature\n(VIIF) as anomaly measurement feature. As a result, VIIF can robustly improve\nthe anomaly discriminating ability and can further reduce the size of redundant\nfeatures stored in M by a large amount. Besides, we provide a novel model\nGraphCore via VIIFs that can fast implement unsupervised FSAD training and can\nimprove the performance of anomaly detection. A comprehensive evaluation is\nprovided for comparing GraphCore and other SOTA anomaly detection models under\nour proposed fewshot anomaly detection setting, which shows GraphCore can\nincrease average AUC by 5.8%, 4.1%, 3.4%, and 1.6% on MVTec AD and by 25.5%,\n22.0%, 16.9%, and 14.1% on MPDD for 1, 2, 4, and 8-shot cases, respectively.\n","authors":["Guoyang Xie","Jinbao Wang","Jiaqi Liu","Feng Zheng","Yaochu Jin"],"pdf_url":"https://arxiv.org/pdf/2301.12082v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.02081v2","updated":"2023-10-12T09:00:34Z","published":"2022-10-05T08:19:16Z","title":"Locate before Answering: Answer Guided Question Localization for Video\n Question Answering","summary":" Video question answering (VideoQA) is an essential task in vision-language\nunderstanding, which has attracted numerous research attention recently.\nNevertheless, existing works mostly achieve promising performances on short\nvideos of duration within 15 seconds. For VideoQA on minute-level long-term\nvideos, those methods are likely to fail because of lacking the ability to deal\nwith noise and redundancy caused by scene changes and multiple actions in the\nvideo. Considering the fact that the question often remains concentrated in a\nshort temporal range, we propose to first locate the question to a segment in\nthe video and then infer the answer using the located segment only. Under this\nscheme, we propose \"Locate before Answering\" (LocAns), a novel approach that\nintegrates a question locator and an answer predictor into an end-to-end model.\nDuring the training phase, the available answer label not only serves as the\nsupervision signal of the answer predictor, but also is used to generate pseudo\ntemporal labels for the question locator. Moreover, we design a decoupled\nalternative training strategy to update the two modules separately. In the\nexperiments, LocAns achieves state-of-the-art performance on two modern\nlong-term VideoQA datasets NExT-QA and ActivityNet-QA, and its qualitative\nexamples show the reliable performance of the question localization.\n","authors":["Tianwen Qian","Ran Cui","Jingjing Chen","Pai Peng","Xiaowei Guo","Yu-Gang Jiang"],"pdf_url":"https://arxiv.org/pdf/2210.02081v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08143v1","updated":"2023-10-12T08:58:01Z","published":"2023-10-12T08:58:01Z","title":"A Deep Learning Framework for Spatiotemporal Ultrasound Localization\n Microscopy","summary":" Ultrasound Localization Microscopy can resolve the microvascular bed down to\na few micrometers. To achieve such performance microbubble contrast agents must\nperfuse the entire microvascular network. Microbubbles are then located\nindividually and tracked over time to sample individual vessels, typically over\nhundreds of thousands of images. To overcome the fundamental limit of\ndiffraction and achieve a dense reconstruction of the network, low microbubble\nconcentrations must be used, which lead to acquisitions lasting several\nminutes. Conventional processing pipelines are currently unable to deal with\ninterference from multiple nearby microbubbles, further reducing achievable\nconcentrations. This work overcomes this problem by proposing a Deep Learning\napproach to recover dense vascular networks from ultrasound acquisitions with\nhigh microbubble concentrations. A realistic mouse brain microvascular network,\nsegmented from 2-photon microscopy, was used to train a three-dimensional\nconvolutional neural network based on a V-net architecture. Ultrasound data\nsets from multiple microbubbles flowing through the microvascular network were\nsimulated and used as ground truth to train the 3D CNN to track microbubbles.\nThe 3D-CNN approach was validated in silico using a subset of the data and in\nvivo on a rat brain acquisition. In silico, the CNN reconstructed vascular\nnetworks with higher precision (81%) than a conventional ULM framework (70%).\nIn vivo, the CNN could resolve micro vessels as small as 10 $\\mu$m with an\nincrease in resolution when compared against a conventional approach.\n","authors":["Léo Milecki","Jonathan Porée","Hatim Belgharbi","Chloé Bourquin","Rafat Damseh","Patrick Delafontaine-Martel","Frédéric Lesage","Maxime Gasse","Jean Provost"],"pdf_url":"https://arxiv.org/pdf/2310.08143v1.pdf","comment":"Copyright 2021 IEEE. Personal use of this material is permitted.\n Permission from IEEE must be obtained for all other uses, in any current or\n future media, including reprinting/republishing this material for advertising\n or promotional purposes, creating new collective works, for resale or\n redistribution to servers or lists, or reuse of any copyrighted component of\n this work in other works"},{"id":"http://arxiv.org/abs/2310.08142v1","updated":"2023-10-12T08:57:33Z","published":"2023-10-12T08:57:33Z","title":"Fine-Grained Annotation for Face Anti-Spoofing","summary":" Face anti-spoofing plays a critical role in safeguarding facial recognition\nsystems against presentation attacks. While existing deep learning methods show\npromising results, they still suffer from the lack of fine-grained annotations,\nwhich lead models to learn task-irrelevant or unfaithful features. In this\npaper, we propose a fine-grained annotation method for face anti-spoofing.\nSpecifically, we first leverage the Segment Anything Model (SAM) to obtain\npixel-wise segmentation masks by utilizing face landmarks as point prompts. The\nface landmarks provide segmentation semantics, which segments the face into\nregions. We then adopt these regions as masks and assemble them into three\nseparate annotation maps: spoof, living, and background maps. Finally, we\ncombine three separate maps into a three-channel map as annotations for model\ntraining. Furthermore, we introduce the Multi-Channel Region Exchange\nAugmentation (MCREA) to diversify training data and reduce overfitting.\nExperimental results demonstrate that our method outperforms existing\nstate-of-the-art approaches in both intra-dataset and cross-dataset\nevaluations.\n","authors":["Xu Chen","Yunde Jia","Yuwei Wu"],"pdf_url":"https://arxiv.org/pdf/2310.08142v1.pdf","comment":"10 pages, 5 figures"},{"id":"http://arxiv.org/abs/2310.08139v1","updated":"2023-10-12T08:55:10Z","published":"2023-10-12T08:55:10Z","title":"DualAug: Exploiting Additional Heavy Augmentation with OOD Data\n Rejection","summary":" Data augmentation is a dominant method for reducing model overfitting and\nimproving generalization. Most existing data augmentation methods tend to find\na compromise in augmenting the data, \\textit{i.e.}, increasing the amplitude of\naugmentation carefully to avoid degrading some data too much and doing harm to\nthe model performance. We delve into the relationship between data augmentation\nand model performance, revealing that the performance drop with heavy\naugmentation comes from the presence of out-of-distribution (OOD) data.\nNonetheless, as the same data transformation has different effects for\ndifferent training samples, even for heavy augmentation, there remains part of\nin-distribution data which is beneficial to model training. Based on the\nobservation, we propose a novel data augmentation method, named\n\\textbf{DualAug}, to keep the augmentation in distribution as much as possible\nat a reasonable time and computational cost. We design a data mixing strategy\nto fuse augmented data from both the basic- and the heavy-augmentation\nbranches. Extensive experiments on supervised image classification benchmarks\nshow that DualAug improve various automated data augmentation method. Moreover,\nthe experiments on semi-supervised learning and contrastive self-supervised\nlearning demonstrate that our DualAug can also improve related method. Code is\navailable at\n\\href{https://github.com/shuguang99/DualAug}{https://github.com/shuguang99/DualAug}.\n","authors":["Zehao Wang","Yiwen Guo","Qizhang Li","Guanglei Yang","Wangmeng Zuo"],"pdf_url":"https://arxiv.org/pdf/2310.08139v1.pdf","comment":"14 pages, 6 figures"},{"id":"http://arxiv.org/abs/2301.11514v5","updated":"2023-10-12T08:49:31Z","published":"2023-01-27T03:18:09Z","title":"Deep Industrial Image Anomaly Detection: A Survey","summary":" The recent rapid development of deep learning has laid a milestone in\nindustrial Image Anomaly Detection (IAD). In this paper, we provide a\ncomprehensive review of deep learning-based image anomaly detection techniques,\nfrom the perspectives of neural network architectures, levels of supervision,\nloss functions, metrics and datasets. In addition, we extract the new setting\nfrom industrial manufacturing and review the current IAD approaches under our\nproposed our new setting. Moreover, we highlight several opening challenges for\nimage anomaly detection. The merits and downsides of representative network\narchitectures under varying supervision are discussed. Finally, we summarize\nthe research findings and point out future research directions. More resources\nare available at\nhttps://github.com/M-3LAB/awesome-industrial-anomaly-detection.\n","authors":["Jiaqi Liu","Guoyang Xie","Jinbao Wang","Shangnian Li","Chengjie Wang","Feng Zheng","Yaochu Jin"],"pdf_url":"https://arxiv.org/pdf/2301.11514v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2301.13359v3","updated":"2023-10-12T08:47:19Z","published":"2023-01-31T01:24:45Z","title":"IM-IAD: Industrial Image Anomaly Detection Benchmark in Manufacturing","summary":" Image anomaly detection (IAD) is an urgent issue that needs to be addressed\nin modern industrial manufacturing (IM). Recently, many advanced algorithms\nhave been released, but their performance varies greatly due to non-uniformed\nsettings. That is, researchers find it difficult to analyze because they are\ndesigned for different or specific cases in IM. To eliminate this problem, we\nfirst propose a uniform IAD setting to systematically assess the effectiveness\nof these algorithms, mainly considering three aspects of supervision level\n(unsupervised, fully supervised), learning paradigm (few-shot, continual, noisy\nlabel), and efficiency (memory usage, inference speed). Then, we skillfully\nconstruct a comprehensive image anomaly detection benchmark (IM-IAD), which\nincludes 19 algorithms on 7 major datasets with the same setting. Our extensive\nexperiments (17,017 total) provide new insights into the redesign or selection\nof the IAD algorithm under uniform conditions. Importantly, the proposed IM-IAD\npresents feasible challenges and future directions for further work. We believe\nthat this work can have a significant impact on the IAD field. To foster\nreproducibility and accessibility, the source code of IM-IAD is uploaded on the\nwebsite, https://github.com/M-3LAB/IM-IAD.\n","authors":["Guoyang Xie","Jinbao Wang","Jiaqi Liu","Jiayi Lyu","Yong Liu","Chengjie Wang","Feng Zheng","Yaochu Jin"],"pdf_url":"https://arxiv.org/pdf/2301.13359v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08129v1","updated":"2023-10-12T08:36:25Z","published":"2023-10-12T08:36:25Z","title":"Tailored Visions: Enhancing Text-to-Image Generation with Personalized\n Prompt Rewriting","summary":" We propose a novel perspective of viewing large pretrained models as search\nengines, thereby enabling the repurposing of techniques previously used to\nenhance search engine performance. As an illustration, we employ a personalized\nquery rewriting technique in the realm of text-to-image generation. Despite\nsignificant progress in the field, it is still challenging to create\npersonalized visual representations that align closely with the desires and\npreferences of individual users. This process requires users to articulate\ntheir ideas in words that are both comprehensible to the models and accurately\ncapture their vision, posing difficulties for many users. In this paper, we\ntackle this challenge by leveraging historical user interactions with the\nsystem to enhance user prompts. We propose a novel approach that involves\nrewriting user prompts based a new large-scale text-to-image dataset with over\n300k prompts from 3115 users. Our rewriting model enhances the expressiveness\nand alignment of user prompts with their intended visual outputs. Experimental\nresults demonstrate the superiority of our methods over baseline approaches, as\nevidenced in our new offline evaluation method and online tests. Our approach\nopens up exciting possibilities of applying more search engine techniques to\nbuild truly personalized large pretrained models.\n","authors":["Zijie Chen","Lichao Zhang","Fangsheng Weng","Lili Pan","Zhenzhong Lan"],"pdf_url":"https://arxiv.org/pdf/2310.08129v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.03198v5","updated":"2023-10-12T08:31:50Z","published":"2023-04-06T16:21:56Z","title":"RFAConv: Innovating Spatial Attention and Standard Convolutional\n Operation","summary":" Spatial attention has been widely used to improve the performance of\nconvolutional neural networks. However, it has certain limitations. In this\npaper, we propose a new perspective on the effectiveness of spatial attention,\nwhich is that the spatial attention mechanism essentially solves the problem of\nconvolutional kernel parameter sharing. However, the information contained in\nthe attention map generated by spatial attention is not sufficient for\nlarge-size convolutional kernels. Therefore, we propose a novel attention\nmechanism called Receptive-Field Attention (RFA). Existing spatial attention,\nsuch as Convolutional Block Attention Module (CBAM) and Coordinated Attention\n(CA) focus only on spatial features, which does not fully address the problem\nof convolutional kernel parameter sharing. In contrast, RFA not only focuses on\nthe receptive-field spatial feature but also provides effective attention\nweights for large-size convolutional kernels. The Receptive-Field Attention\nconvolutional operation (RFAConv), developed by RFA, represents a new approach\nto replace the standard convolution operation. It offers nearly negligible\nincrement of computational cost and parameters, while significantly improving\nnetwork performance. We conducted a series of experiments on ImageNet-1k, COCO,\nand VOC datasets to demonstrate the superiority of our approach. Of particular\nimportance, we believe that it is time to shift focus from spatial features to\nreceptive-field spatial features for current spatial attention mechanisms. In\nthis way, we can further improve network performance and achieve even better\nresults. The code and pre-trained models for the relevant tasks can be found at\nhttps://github.com/Liuchen1997/RFAConv.\n","authors":["Xin Zhang","Chen Liu","Degang Yang","Tingting Song","Yichen Ye","Ke Li","Yingze Song"],"pdf_url":"https://arxiv.org/pdf/2304.03198v5.pdf","comment":"12 pages, 11figures"},{"id":"http://arxiv.org/abs/2310.08117v1","updated":"2023-10-12T08:21:17Z","published":"2023-10-12T08:21:17Z","title":"DUSA: Decoupled Unsupervised Sim2Real Adaptation for\n Vehicle-to-Everything Collaborative Perception","summary":" Vehicle-to-Everything (V2X) collaborative perception is crucial for\nautonomous driving. However, achieving high-precision V2X perception requires a\nsignificant amount of annotated real-world data, which can always be expensive\nand hard to acquire. Simulated data have raised much attention since they can\nbe massively produced at an extremely low cost. Nevertheless, the significant\ndomain gap between simulated and real-world data, including differences in\nsensor type, reflectance patterns, and road surroundings, often leads to poor\nperformance of models trained on simulated data when evaluated on real-world\ndata. In addition, there remains a domain gap between real-world collaborative\nagents, e.g. different types of sensors may be installed on autonomous vehicles\nand roadside infrastructures with different extrinsics, further increasing the\ndifficulty of sim2real generalization. To take full advantage of simulated\ndata, we present a new unsupervised sim2real domain adaptation method for V2X\ncollaborative detection named Decoupled Unsupervised Sim2Real Adaptation\n(DUSA). Our new method decouples the V2X collaborative sim2real domain\nadaptation problem into two sub-problems: sim2real adaptation and inter-agent\nadaptation. For sim2real adaptation, we design a Location-adaptive Sim2Real\nAdapter (LSA) module to adaptively aggregate features from critical locations\nof the feature map and align the features between simulated data and real-world\ndata via a sim/real discriminator on the aggregated global feature. For\ninter-agent adaptation, we further devise a Confidence-aware Inter-agent\nAdapter (CIA) module to align the fine-grained features from heterogeneous\nagents under the guidance of agent-wise confidence maps. Experiments\ndemonstrate the effectiveness of the proposed DUSA approach on unsupervised\nsim2real adaptation from the simulated V2XSet dataset to the real-world\nDAIR-V2X-C dataset.\n","authors":["Xianghao Kong","Wentao Jiang","Jinrang Jia","Yifeng Shi","Runsheng Xu","Si Liu"],"pdf_url":"https://arxiv.org/pdf/2310.08117v1.pdf","comment":"ACM MM 2023"},{"id":"http://arxiv.org/abs/2310.08116v1","updated":"2023-10-12T08:17:57Z","published":"2023-10-12T08:17:57Z","title":"Multimodal Active Measurement for Human Mesh Recovery in Close Proximity","summary":" For safe and sophisticated physical human-robot interactions (pHRI), a robot\nneeds to estimate the accurate body pose or mesh of the target person. However,\nin these pHRI scenarios, the robot cannot fully observe the target person's\nbody with equipped cameras because the target person is usually close to the\nrobot. This leads to severe truncation and occlusions, and results in poor\naccuracy of human pose estimation. For better accuracy of human pose estimation\nor mesh recovery on this limited information from cameras, we propose an active\nmeasurement and sensor fusion framework of the equipped cameras and other\nsensors such as touch sensors and 2D LiDAR. These touch and LiDAR sensing are\nobtained attendantly through pHRI without additional costs. These sensor\nmeasurements are sparse but reliable and informative cues for human mesh\nrecovery. In our active measurement process, camera viewpoints and sensor\nplacements are optimized based on the uncertainty of the estimated pose, which\nis closely related to the truncated or occluded areas. In our sensor fusion\nprocess, we fuse the sensor measurements to the camera-based estimated pose by\nminimizing the distance between the estimated mesh and measured positions. Our\nmethod is agnostic to robot configurations. Experiments were conducted using\nthe Toyota Human Support Robot, which has a camera, 2D LiDAR, and a touch\nsensor on the robot arm. Our proposed method demonstrated the superiority in\nthe human pose estimation accuracy on the quantitative comparison. Furthermore,\nour proposed method reliably estimated the pose of the target person in\npractical settings such as target people occluded by a blanket and standing aid\nwith the robot arm.\n","authors":["Takahiro Maeda","Keisuke Takeshita","Kazuhito Tanaka"],"pdf_url":"https://arxiv.org/pdf/2310.08116v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2211.14108v3","updated":"2023-10-12T08:05:18Z","published":"2022-11-25T13:50:00Z","title":"3DDesigner: Towards Photorealistic 3D Object Generation and Editing with\n Text-guided Diffusion Models","summary":" Text-guided diffusion models have shown superior performance in image/video\ngeneration and editing. While few explorations have been performed in 3D\nscenarios. In this paper, we discuss three fundamental and interesting problems\non this topic. First, we equip text-guided diffusion models to achieve\n3D-consistent generation. Specifically, we integrate a NeRF-like neural field\nto generate low-resolution coarse results for a given camera view. Such results\ncan provide 3D priors as condition information for the following diffusion\nprocess. During denoising diffusion, we further enhance the 3D consistency by\nmodeling cross-view correspondences with a novel two-stream (corresponding to\ntwo different views) asynchronous diffusion process. Second, we study 3D local\nediting and propose a two-step solution that can generate 360-degree\nmanipulated results by editing an object from a single view. Step 1, we propose\nto perform 2D local editing by blending the predicted noises. Step 2, we\nconduct a noise-to-text inversion process that maps 2D blended noises into the\nview-independent text embedding space. Once the corresponding text embedding is\nobtained, 360-degree images can be generated. Last but not least, we extend our\nmodel to perform one-shot novel view synthesis by fine-tuning on a single\nimage, firstly showing the potential of leveraging text guidance for novel view\nsynthesis. Extensive experiments and various applications show the prowess of\nour 3DDesigner. The project page is available at\nhttps://3ddesigner-diffusion.github.io/.\n","authors":["Gang Li","Heliang Zheng","Chaoyue Wang","Chang Li","Changwen Zheng","Dacheng Tao"],"pdf_url":"https://arxiv.org/pdf/2211.14108v3.pdf","comment":"Submitted to IJCV"},{"id":"http://arxiv.org/abs/2310.00847v2","updated":"2023-10-12T08:04:14Z","published":"2023-10-02T02:01:00Z","title":"Can Pre-trained Networks Detect Familiar Out-of-Distribution Data?","summary":" Out-of-distribution (OOD) detection is critical for safety-sensitive machine\nlearning applications and has been extensively studied, yielding a plethora of\nmethods developed in the literature. However, most studies for OOD detection\ndid not use pre-trained models and trained a backbone from scratch. In recent\nyears, transferring knowledge from large pre-trained models to downstream tasks\nby lightweight tuning has become mainstream for training in-distribution (ID)\nclassifiers. To bridge the gap between the practice of OOD detection and\ncurrent classifiers, the unique and crucial problem is that the samples whose\ninformation networks know often come as OOD input. We consider that such data\nmay significantly affect the performance of large pre-trained networks because\nthe discriminability of these OOD data depends on the pre-training algorithm.\nHere, we define such OOD data as PT-OOD (Pre-Trained OOD) data. In this paper,\nwe aim to reveal the effect of PT-OOD on the OOD detection performance of\npre-trained networks from the perspective of pre-training algorithms. To\nachieve this, we explore the PT-OOD detection performance of supervised and\nself-supervised pre-training algorithms with linear-probing tuning, the most\ncommon efficient tuning method. Through our experiments and analysis, we find\nthat the low linear separability of PT-OOD in the feature space heavily\ndegrades the PT-OOD detection performance, and self-supervised models are more\nvulnerable to PT-OOD than supervised pre-trained models, even with\nstate-of-the-art detection methods. To solve this vulnerability, we further\npropose a unique solution to large-scale pre-trained models: Leveraging\npowerful instance-by-instance discriminative representations of pre-trained\nmodels and detecting OOD in the feature space independent of the ID decision\nboundaries. The code will be available via https://github.com/AtsuMiyai/PT-OOD.\n","authors":["Atsuyuki Miyai","Qing Yu","Go Irie","Kiyoharu Aizawa"],"pdf_url":"https://arxiv.org/pdf/2310.00847v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07522v2","updated":"2023-10-12T08:02:18Z","published":"2023-10-11T14:19:05Z","title":"S4C: Self-Supervised Semantic Scene Completion with Neural Fields","summary":" 3D semantic scene understanding is a fundamental challenge in computer\nvision. It enables mobile agents to autonomously plan and navigate arbitrary\nenvironments. SSC formalizes this challenge as jointly estimating dense\ngeometry and semantic information from sparse observations of a scene. Current\nmethods for SSC are generally trained on 3D ground truth based on aggregated\nLiDAR scans. This process relies on special sensors and annotation by hand\nwhich are costly and do not scale well. To overcome this issue, our work\npresents the first self-supervised approach to SSC called S4C that does not\nrely on 3D ground truth data. Our proposed method can reconstruct a scene from\na single image and only relies on videos and pseudo segmentation ground truth\ngenerated from off-the-shelf image segmentation network during training. Unlike\nexisting methods, which use discrete voxel grids, we represent scenes as\nimplicit semantic fields. This formulation allows querying any point within the\ncamera frustum for occupancy and semantic class. Our architecture is trained\nthrough rendering-based self-supervised losses. Nonetheless, our method\nachieves performance close to fully supervised state-of-the-art methods.\nAdditionally, our method demonstrates strong generalization capabilities and\ncan synthesize accurate segmentation maps for far away viewpoints.\n","authors":["Adrian Hayler","Felix Wimbauer","Dominik Muhle","Christian Rupprecht","Daniel Cremers"],"pdf_url":"https://arxiv.org/pdf/2310.07522v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08106v1","updated":"2023-10-12T08:01:11Z","published":"2023-10-12T08:01:11Z","title":"Generalized Logit Adjustment: Calibrating Fine-tuned Models by Removing\n Label Bias in Foundation Models","summary":" Foundation models like CLIP allow zero-shot transfer on various tasks without\nadditional training data. Yet, the zero-shot performance is less competitive\nthan a fully supervised one. Thus, to enhance the performance, fine-tuning and\nensembling are also commonly adopted to better fit the downstream tasks.\nHowever, we argue that such prior work has overlooked the inherent biases in\nfoundation models. Due to the highly imbalanced Web-scale training set, these\nfoundation models are inevitably skewed toward frequent semantics, and thus the\nsubsequent fine-tuning or ensembling is still biased. In this study, we\nsystematically examine the biases in foundation models and demonstrate the\nefficacy of our proposed Generalized Logit Adjustment (GLA) method. Note that\nbias estimation in foundation models is challenging, as most pre-train data\ncannot be explicitly accessed like in traditional long-tailed classification\ntasks. To this end, GLA has an optimization-based bias estimation approach for\ndebiasing foundation models. As our work resolves a fundamental flaw in the\npre-training, the proposed GLA demonstrates significant improvements across a\ndiverse range of tasks: it achieves 1.5 pp accuracy gains on ImageNet, an large\naverage improvement (1.4-4.6 pp) on 11 few-shot datasets, 2.4 pp gains on\nlong-tailed classification. Codes are in \\url{https://github.com/BeierZhu/GLA}.\n","authors":["Beier Zhu","Kaihua Tang","Qianru Sun","Hanwang Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.08106v1.pdf","comment":"Accepted by NeurIPS2023"},{"id":"http://arxiv.org/abs/2212.09408v3","updated":"2023-10-12T07:55:38Z","published":"2022-12-19T12:40:13Z","title":"Universal Object Detection with Large Vision Model","summary":" Over the past few years, there has been growing interest in developing a\nbroad, universal, and general-purpose computer vision system. Such systems have\nthe potential to address a wide range of vision tasks simultaneously, without\nbeing limited to specific problems or data domains. This universality is\ncrucial for practical, real-world computer vision applications. In this study,\nour focus is on a specific challenge: the large-scale, multi-domain universal\nobject detection problem, which contributes to the broader goal of achieving a\nuniversal vision system. This problem presents several intricate challenges,\nincluding cross-dataset category label duplication, label conflicts, and the\nnecessity to handle hierarchical taxonomies. To address these challenges, we\nintroduce our approach to label handling, hierarchy-aware loss design, and\nresource-efficient model training utilizing a pre-trained large vision model.\nOur method has demonstrated remarkable performance, securing a prestigious\nsecond-place ranking in the object detection track of the Robust Vision\nChallenge 2022 (RVC 2022) on a million-scale cross-dataset object detection\nbenchmark. We believe that our comprehensive study will serve as a valuable\nreference and offer an alternative approach for addressing similar challenges\nwithin the computer vision community. The source code for our work is openly\navailable at https://github.com/linfeng93/Large-UniDet.\n","authors":["Feng Lin","Wenze Hu","Yaowei Wang","Yonghong Tian","Guangming Lu","Fanglin Chen","Yong Xu","Xiaoyu Wang"],"pdf_url":"https://arxiv.org/pdf/2212.09408v3.pdf","comment":"Accepted by International Journal of Computer Vision (IJCV). The 2nd\n place in the object detection track of the Robust Vision Challenge (RVC 2022)"},{"id":"http://arxiv.org/abs/2309.15505v2","updated":"2023-10-12T07:55:05Z","published":"2023-09-27T09:13:40Z","title":"Finite Scalar Quantization: VQ-VAE Made Simple","summary":" We propose to replace vector quantization (VQ) in the latent representation\nof VQ-VAEs with a simple scheme termed finite scalar quantization (FSQ), where\nwe project the VAE representation down to a few dimensions (typically less than\n10). Each dimension is quantized to a small set of fixed values, leading to an\n(implicit) codebook given by the product of these sets. By appropriately\nchoosing the number of dimensions and values each dimension can take, we obtain\nthe same codebook size as in VQ. On top of such discrete representations, we\ncan train the same models that have been trained on VQ-VAE representations. For\nexample, autoregressive and masked transformer models for image generation,\nmultimodal generation, and dense prediction computer vision tasks. Concretely,\nwe employ FSQ with MaskGIT for image generation, and with UViM for depth\nestimation, colorization, and panoptic segmentation. Despite the much simpler\ndesign of FSQ, we obtain competitive performance in all these tasks. We\nemphasize that FSQ does not suffer from codebook collapse and does not need the\ncomplex machinery employed in VQ (commitment losses, codebook reseeding, code\nsplitting, entropy penalties, etc.) to learn expressive discrete\nrepresentations.\n","authors":["Fabian Mentzer","David Minnen","Eirikur Agustsson","Michael Tschannen"],"pdf_url":"https://arxiv.org/pdf/2309.15505v2.pdf","comment":"Code:\n https://github.com/google-research/google-research/tree/master/fsq"},{"id":"http://arxiv.org/abs/2310.08094v1","updated":"2023-10-12T07:40:39Z","published":"2023-10-12T07:40:39Z","title":"SingleInsert: Inserting New Concepts from a Single Image into\n Text-to-Image Models for Flexible Editing","summary":" Recent progress in text-to-image (T2I) models enables high-quality image\ngeneration with flexible textual control. To utilize the abundant visual priors\nin the off-the-shelf T2I models, a series of methods try to invert an image to\nproper embedding that aligns with the semantic space of the T2I model. However,\nthese image-to-text (I2T) inversion methods typically need multiple source\nimages containing the same concept or struggle with the imbalance between\nediting flexibility and visual fidelity. In this work, we point out that the\ncritical problem lies in the foreground-background entanglement when learning\nan intended concept, and propose a simple and effective baseline for\nsingle-image I2T inversion, named SingleInsert. SingleInsert adopts a two-stage\nscheme. In the first stage, we regulate the learned embedding to concentrate on\nthe foreground area without being associated with the irrelevant background. In\nthe second stage, we finetune the T2I model for better visual resemblance and\ndevise a semantic loss to prevent the language drift problem. With the proposed\ntechniques, SingleInsert excels in single concept generation with high visual\nfidelity while allowing flexible editing. Additionally, SingleInsert can\nperform single-image novel view synthesis and multiple concepts composition\nwithout requiring joint training. To facilitate evaluation, we design an\nediting prompt list and introduce a metric named Editing Success Rate (ESR) for\nquantitative assessment of editing flexibility. Our project page is:\nhttps://jarrentwu1031.github.io/SingleInsert-web/\n","authors":["Zijie Wu","Chaohui Yu","Zhen Zhu","Fan Wang","Xiang Bai"],"pdf_url":"https://arxiv.org/pdf/2310.08094v1.pdf","comment":"Project page: https://jarrentwu1031.github.io/SingleInsert-web/"},{"id":"http://arxiv.org/abs/2310.08092v1","updated":"2023-10-12T07:38:28Z","published":"2023-10-12T07:38:28Z","title":"Consistent123: Improve Consistency for One Image to 3D Object Synthesis","summary":" Large image diffusion models enable novel view synthesis with high quality\nand excellent zero-shot capability. However, such models based on\nimage-to-image translation have no guarantee of view consistency, limiting the\nperformance for downstream tasks like 3D reconstruction and image-to-3D\ngeneration. To empower consistency, we propose Consistent123 to synthesize\nnovel views simultaneously by incorporating additional cross-view attention\nlayers and the shared self-attention mechanism. The proposed attention\nmechanism improves the interaction across all synthesized views, as well as the\nalignment between the condition view and novel views. In the sampling stage,\nsuch architecture supports simultaneously generating an arbitrary number of\nviews while training at a fixed length. We also introduce a progressive\nclassifier-free guidance strategy to achieve the trade-off between texture and\ngeometry for synthesized object views. Qualitative and quantitative experiments\nshow that Consistent123 outperforms baselines in view consistency by a large\nmargin. Furthermore, we demonstrate a significant improvement of Consistent123\non varying downstream tasks, showing its great potential in the 3D generation\nfield. The project page is available at consistent-123.github.io.\n","authors":["Haohan Weng","Tianyu Yang","Jianan Wang","Yu Li","Tong Zhang","C. L. Philip Chen","Lei Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.08092v1.pdf","comment":"For more qualitative results, please see\n https://consistent-123.github.io/"},{"id":"http://arxiv.org/abs/2212.00564v2","updated":"2023-10-12T07:21:23Z","published":"2022-12-01T15:11:21Z","title":"Leveraging Single-View Images for Unsupervised 3D Point Cloud Completion","summary":" Point clouds captured by scanning devices are often incomplete due to\nocclusion. To overcome this limitation, point cloud completion methods have\nbeen developed to predict the complete shape of an object based on its partial\ninput. These methods can be broadly classified as supervised or unsupervised.\nHowever, both categories require a large number of 3D complete point clouds,\nwhich may be difficult to capture. In this paper, we propose Cross-PCC, an\nunsupervised point cloud completion method without requiring any 3D complete\npoint clouds. We only utilize 2D images of the complete objects, which are\neasier to capture than 3D complete and clean point clouds. Specifically, to\ntake advantage of the complementary information from 2D images, we use a\nsingle-view RGB image to extract 2D features and design a fusion module to fuse\nthe 2D and 3D features extracted from the partial point cloud. To guide the\nshape of predicted point clouds, we project the predicted points of the object\nto the 2D plane and use the foreground pixels of its silhouette maps to\nconstrain the position of the projected points. To reduce the outliers of the\npredicted point clouds, we propose a view calibrator to move the points\nprojected to the background into the foreground by the single-view silhouette\nimage. To the best of our knowledge, our approach is the first point cloud\ncompletion method that does not require any 3D supervision. The experimental\nresults of our method are superior to those of the state-of-the-art\nunsupervised methods by a large margin. Moreover, our method even achieves\ncomparable performance to some supervised methods. We will make the source code\npublicly available at https://github.com/ltwu6/cross-pcc.\n","authors":["Lintai Wu","Qijian Zhang","Junhui Hou","Yong Xu"],"pdf_url":"https://arxiv.org/pdf/2212.00564v2.pdf","comment":"14 pages, 10 figures"},{"id":"http://arxiv.org/abs/2310.08084v1","updated":"2023-10-12T07:17:14Z","published":"2023-10-12T07:17:14Z","title":"Volumetric Medical Image Segmentation via Scribble Annotations and Shape\n Priors","summary":" Recently, weakly-supervised image segmentation using weak annotations like\nscribbles has gained great attention in computer vision and medical image\nanalysis, since such annotations are much easier to obtain compared to\ntime-consuming and labor-intensive labeling at the pixel/voxel level. However,\ndue to a lack of structure supervision on regions of interest (ROIs), existing\nscribble-based methods suffer from poor boundary localization. Furthermore,\nmost current methods are designed for 2D image segmentation, which do not fully\nleverage the volumetric information if directly applied to each image slice. In\nthis paper, we propose a scribble-based volumetric image segmentation,\nScribble2D5, which tackles 3D anisotropic image segmentation and aims to its\nimprove boundary prediction. To achieve this, we augment a 2.5D attention UNet\nwith a proposed label propagation module to extend semantic information from\nscribbles and use a combination of static and active boundary prediction to\nlearn ROI's boundary and regularize its shape. Also, we propose an optional\nadd-on component, which incorporates the shape prior information from unpaired\nsegmentation masks to further improve model accuracy. Extensive experiments on\nthree public datasets and one private dataset demonstrate our Scribble2D5\nachieves state-of-the-art performance on volumetric image segmentation using\nscribbles and shape prior if available.\n","authors":["Qiuhui Chen","Haiying Lyu","Xinyue Hu","Yong Lu","Yi Hong"],"pdf_url":"https://arxiv.org/pdf/2310.08084v1.pdf","comment":"arXiv admin note: text overlap with arXiv:2205.06779"},{"id":"http://arxiv.org/abs/2310.08082v1","updated":"2023-10-12T07:12:20Z","published":"2023-10-12T07:12:20Z","title":"Jointly Optimized Global-Local Visual Localization of UAVs","summary":" Navigation and localization of UAVs present a challenge when global\nnavigation satellite systems (GNSS) are disrupted and unreliable. Traditional\ntechniques, such as simultaneous localization and mapping (SLAM) and visual\nodometry (VO), exhibit certain limitations in furnishing absolute coordinates\nand mitigating error accumulation. Existing visual localization methods achieve\nautonomous visual localization without error accumulation by matching with\northo satellite images. However, doing so cannot guarantee real-time\nperformance due to the complex matching process. To address these challenges,\nwe propose a novel Global-Local Visual Localization (GLVL) network. Our GLVL\nnetwork is a two-stage visual localization approach, combining a large-scale\nretrieval module that finds similar regions with the UAV flight scene, and a\nfine-grained matching module that localizes the precise UAV coordinate,\nenabling real-time and precise localization. The training process is jointly\noptimized in an end-to-end manner to further enhance the model capability.\nExperiments on six UAV flight scenes encompassing both texture-rich and\ntexture-sparse regions demonstrate the ability of our model to achieve the\nreal-time precise localization requirements of UAVs. Particularly, our method\nachieves a localization error of only 2.39 meters in 0.48 seconds in a village\nscene with sparse texture features.\n","authors":["Haoling Li","Jiuniu Wang","Zhiwei Wei","Wenjia Xu"],"pdf_url":"https://arxiv.org/pdf/2310.08082v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08080v1","updated":"2023-10-12T07:10:12Z","published":"2023-10-12T07:10:12Z","title":"RT-SRTS: Angle-Agnostic Real-Time Simultaneous 3D Reconstruction and\n Tumor Segmentation from Single X-Ray Projection","summary":" Radiotherapy is one of the primary treatment methods for tumors, but the\norgan movement caused by respiratory motion limits its accuracy. Recently, 3D\nimaging from single X-ray projection receives extensive attentions as a\npromising way to address this issue. However, current methods can only\nreconstruct 3D image without direct location of the tumor and are only\nvalidated for fixed-angle imaging, which fails to fully meet the requirement of\nmotion control in radiotherapy. In this study, we propose a novel imaging\nmethod RT-SRTS which integrates 3D imaging and tumor segmentation into one\nnetwork based on the multi-task learning (MTL) and achieves real-time\nsimultaneous 3D reconstruction and tumor segmentation from single X-ray\nprojection at any angle. Futhermore, we propose the attention enhanced\ncalibrator (AEC) and uncertain-region elaboration (URE) modules to aid feature\nextraction and improve segmentation accuracy. We evaluated the proposed method\non ten patient cases and compared it with two state-of-the-art methods. Our\napproach not only delivered superior 3D reconstruction but also demonstrated\ncommendable tumor segmentation results. The simultaneous reconstruction and\nsegmentation could be completed in approximately 70 ms, significantly faster\nthan the required time threshold for real-time tumor tracking. The efficacy of\nboth AEC and URE was also validated through ablation studies.\n","authors":["Miao Zhu","Qiming Fu","Bo Liu","Mengxi Zhang","Bojian Li","Xiaoyan Luo","Fugen Zhou"],"pdf_url":"https://arxiv.org/pdf/2310.08080v1.pdf","comment":"27 pages"},{"id":"http://arxiv.org/abs/2210.11318v2","updated":"2023-10-12T06:52:18Z","published":"2022-10-20T14:51:01Z","title":"A Survey of Computer Vision Technologies In Urban and\n Controlled-environment Agriculture","summary":" In the evolution of agriculture to its next stage, Agriculture 5.0,\nartificial intelligence will play a central role. Controlled-environment\nagriculture, or CEA, is a special form of urban and suburban agricultural\npractice that offers numerous economic, environmental, and social benefits,\nincluding shorter transportation routes to population centers, reduced\nenvironmental impact, and increased productivity. Due to its ability to control\nenvironmental factors, CEA couples well with computer vision (CV) in the\nadoption of real-time monitoring of the plant conditions and autonomous\ncultivation and harvesting. The objective of this paper is to familiarize CV\nresearchers with agricultural applications and agricultural practitioners with\nthe solutions offered by CV. We identify five major CV applications in CEA,\nanalyze their requirements and motivation, and survey the state of the art as\nreflected in 68 technical papers using deep learning methods. In addition, we\ndiscuss five key subareas of computer vision and how they related to these CEA\nproblems, as well as eleven vision-based CEA datasets. We hope the survey will\nhelp researchers quickly gain a bird-eye view of the striving research area and\nwill spark inspiration for new research and development.\n","authors":["Jiayun Luo","Boyang Li","Cyril Leung"],"pdf_url":"https://arxiv.org/pdf/2210.11318v2.pdf","comment":"1 overview figures, 37 pages, 8 tables, accepted by ACM Computing\n Surveys"},{"id":"http://arxiv.org/abs/2310.08073v1","updated":"2023-10-12T06:50:43Z","published":"2023-10-12T06:50:43Z","title":"Samples on Thin Ice: Re-Evaluating Adversarial Pruning of Neural\n Networks","summary":" Neural network pruning has shown to be an effective technique for reducing\nthe network size, trading desirable properties like generalization and\nrobustness to adversarial attacks for higher sparsity. Recent work has claimed\nthat adversarial pruning methods can produce sparse networks while also\npreserving robustness to adversarial examples. In this work, we first\nre-evaluate three state-of-the-art adversarial pruning methods, showing that\ntheir robustness was indeed overestimated. We then compare pruned and dense\nversions of the same models, discovering that samples on thin ice, i.e., closer\nto the unpruned model's decision boundary, are typically misclassified after\npruning. We conclude by discussing how this intuition may lead to designing\nmore effective adversarial pruning methods in future work.\n","authors":["Giorgio Piras","Maura Pintor","Ambra Demontis","Battista Biggio"],"pdf_url":"https://arxiv.org/pdf/2310.08073v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08071v1","updated":"2023-10-12T06:36:41Z","published":"2023-10-12T06:36:41Z","title":"Learning Transferable Conceptual Prototypes for Interpretable\n Unsupervised Domain Adaptation","summary":" Despite the great progress of unsupervised domain adaptation (UDA) with the\ndeep neural networks, current UDA models are opaque and cannot provide\npromising explanations, limiting their applications in the scenarios that\nrequire safe and controllable model decisions. At present, a surge of work\nfocuses on designing deep interpretable methods with adequate data annotations\nand only a few methods consider the distributional shift problem. Most existing\ninterpretable UDA methods are post-hoc ones, which cannot facilitate the model\nlearning process for performance enhancement. In this paper, we propose an\ninherently interpretable method, named Transferable Conceptual Prototype\nLearning (TCPL), which could simultaneously interpret and improve the processes\nof knowledge transfer and decision-making in UDA. To achieve this goal, we\ndesign a hierarchically prototypical module that transfers categorical basic\nconcepts from the source domain to the target domain and learns domain-shared\nprototypes for explaining the underlying reasoning process. With the learned\ntransferable prototypes, a self-predictive consistent pseudo-label strategy\nthat fuses confidence, predictions, and prototype information, is designed for\nselecting suitable target samples for pseudo annotations and gradually\nnarrowing down the domain gap. Comprehensive experiments show that the proposed\nmethod can not only provide effective and intuitive explanations but also\noutperform previous state-of-the-arts.\n","authors":["Junyu Gao","Xinhong Ma","Changsheng Xu"],"pdf_url":"https://arxiv.org/pdf/2310.08071v1.pdf","comment":"Submitted to IEEE TIP"},{"id":"http://arxiv.org/abs/2310.08068v1","updated":"2023-10-12T06:32:12Z","published":"2023-10-12T06:32:12Z","title":"Frequency-Aware Re-Parameterization for Over-Fitting Based Image\n Compression","summary":" Over-fitting-based image compression requires weights compactness for\ncompression and fast convergence for practical use, posing challenges for deep\nconvolutional neural networks (CNNs) based methods. This paper presents a\nsimple re-parameterization method to train CNNs with reduced weights storage\nand accelerated convergence. The convolution kernels are re-parameterized as a\nweighted sum of discrete cosine transform (DCT) kernels enabling direct\noptimization in the frequency domain. Combined with L1 regularization, the\nproposed method surpasses vanilla convolutions by achieving a significantly\nimproved rate-distortion with low computational cost. The proposed method is\nverified with extensive experiments of over-fitting-based image restoration on\nvarious datasets, achieving up to -46.12% BD-rate on top of HEIF with only 200\niterations.\n","authors":["Yun Ye","Yanjie Pan","Qually Jiang","Ming Lu","Xiaoran Fang","Beryl Xu"],"pdf_url":"https://arxiv.org/pdf/2310.08068v1.pdf","comment":"to be published at ICIP 2023, this version fixed a mistake in Eq. (1)\n in the proceeding version"},{"id":"http://arxiv.org/abs/2310.08064v1","updated":"2023-10-12T06:26:39Z","published":"2023-10-12T06:26:39Z","title":"Age Estimation Based on Graph Convolutional Networks and Multi-head\n Attention Mechanisms","summary":" Age estimation technology is a part of facial recognition and has been\napplied to identity authentication. This technology achieves the development\nand application of a juvenile anti-addiction system by authenticating users in\nthe game. Convolutional Neural Network (CNN) and Transformer algorithms are\nwidely used in this application scenario. However, these two models cannot\nflexibly extract and model features of faces with irregular shapes, and they\nare ineffective in capturing key information. Furthermore, the above methods\nwill contain a lot of background information while extracting features, which\nwill interfere with the model. In consequence, it is easy to extract redundant\ninformation from images. In this paper, a new modeling idea is proposed to\nsolve this problem, which can flexibly model irregular objects. The Graph\nConvolutional Network (GCN) is used to extract features from irregular face\nimages effectively, and multi-head attention mechanisms are added to avoid\nredundant features and capture key region information in the image. This model\ncan effectively improve the accuracy of age estimation and reduce the MAE error\nvalue to about 3.64, which is better than the effect of today's age estimation\nmodel, to improve the accuracy of face recognition and identity authentication.\n","authors":["Miaomiao Yang","Changwei Yao","Shijin Yan"],"pdf_url":"https://arxiv.org/pdf/2310.08064v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.07763v2","updated":"2023-10-12T06:24:02Z","published":"2023-07-15T10:06:43Z","title":"Tightly-Coupled LiDAR-Visual SLAM Based on Geometric Features for Mobile\n Agents","summary":" The mobile robot relies on SLAM (Simultaneous Localization and Mapping) to\nprovide autonomous navigation and task execution in complex and unknown\nenvironments. However, it is hard to develop a dedicated algorithm for mobile\nrobots due to dynamic and challenging situations, such as poor lighting\nconditions and motion blur. To tackle this issue, we propose a tightly-coupled\nLiDAR-visual SLAM based on geometric features, which includes two sub-systems\n(LiDAR and monocular visual SLAM) and a fusion framework. The fusion framework\nassociates the depth and semantics of the multi-modal geometric features to\ncomplement the visual line landmarks and to add direction optimization in\nBundle Adjustment (BA). This further constrains visual odometry. On the other\nhand, the entire line segment detected by the visual subsystem overcomes the\nlimitation of the LiDAR subsystem, which can only perform the local calculation\nfor geometric features. It adjusts the direction of linear feature points and\nfilters out outliers, leading to a higher accurate odometry system. Finally, we\nemploy a module to detect the subsystem's operation, providing the LiDAR\nsubsystem's output as a complementary trajectory to our system while visual\nsubsystem tracking fails. The evaluation results on the public dataset M2DGR,\ngathered from ground robots across various indoor and outdoor scenarios, show\nthat our system achieves more accurate and robust pose estimation compared to\ncurrent state-of-the-art multi-modal methods.\n","authors":["Ke Cao","Ruiping Liu","Ze Wang","Kunyu Peng","Jiaming Zhang","Junwei Zheng","Zhifeng Teng","Kailun Yang","Rainer Stiefelhagen"],"pdf_url":"https://arxiv.org/pdf/2307.07763v2.pdf","comment":"Accepted to ROBIO 2023"},{"id":"http://arxiv.org/abs/2306.06599v4","updated":"2023-10-12T05:51:30Z","published":"2023-06-11T06:27:06Z","title":"Variational Imbalanced Regression: Fair Uncertainty Quantification via\n Probabilistic Smoothing","summary":" Existing regression models tend to fall short in both accuracy and\nuncertainty estimation when the label distribution is imbalanced. In this\npaper, we propose a probabilistic deep learning model, dubbed variational\nimbalanced regression (VIR), which not only performs well in imbalanced\nregression but naturally produces reasonable uncertainty estimation as a\nbyproduct. Different from typical variational autoencoders assuming I.I.D.\nrepresentations (a data point's representation is not directly affected by\nother data points), our VIR borrows data with similar regression labels to\ncompute the latent representation's variational distribution; furthermore,\ndifferent from deterministic regression models producing point estimates, VIR\npredicts the entire normal-inverse-gamma distributions and modulates the\nassociated conjugate distributions to impose probabilistic reweighting on the\nimbalanced data, thereby providing better uncertainty estimation. Experiments\nin several real-world datasets show that our VIR can outperform\nstate-of-the-art imbalanced regression models in terms of both accuracy and\nuncertainty estimation. Code will soon be available at\n\\url{https://github.com/Wang-ML-Lab/variational-imbalanced-regression}.\n","authors":["Ziyan Wang","Hao Wang"],"pdf_url":"https://arxiv.org/pdf/2306.06599v4.pdf","comment":"Accepted at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2302.08594v3","updated":"2023-10-12T05:43:45Z","published":"2023-02-16T21:38:36Z","title":"TransUPR: A Transformer-based Uncertain Point Refiner for LiDAR Point\n Cloud Semantic Segmentation","summary":" Common image-based LiDAR point cloud semantic segmentation (LiDAR PCSS)\napproaches have bottlenecks resulting from the boundary-blurring problem of\nconvolution neural networks (CNNs) and quantitation loss of spherical\nprojection. In this work, we propose a transformer-based plug-and-play\nuncertain point refiner, i.e., TransUPR, to refine selected uncertain points in\na learnable manner, which leads to an improved segmentation performance.\nUncertain points are sampled from coarse semantic segmentation results of 2D\nimage segmentation where uncertain points are located close to the object\nboundaries in the 2D range image representation and 3D spherical projection\nbackground points. Following that, the geometry and coarse semantic features of\nuncertain points are aggregated by neighbor points in 3D space without adding\nexpensive computation and memory footprint. Finally, the transformer-based\nrefiner, which contains four stacked self-attention layers, along with an MLP\nmodule, is utilized for uncertain point classification on the concatenated\nfeatures of self-attention layers. As the proposed refiner is independent of 2D\nCNNs, our TransUPR can be easily integrated into any existing image-based LiDAR\nPCSS approaches, e.g., CENet. Our TransUPR with the CENet achieves\nstate-of-the-art performance, i.e., 68.2% mean Intersection over Union (mIoU)\non the Semantic KITTI benchmark, which provides a performance improvement of\n0.6% on the mIoU compared to the original CENet.\n","authors":["Zifan Yu","Meida Chen","Zhikang Zhang","Suya You","Raghuveer Rao","Sanjeev Agarwal","Fengbo Ren"],"pdf_url":"https://arxiv.org/pdf/2302.08594v3.pdf","comment":"6 pages; Accepted by 2023 IROS"},{"id":"http://arxiv.org/abs/2310.08044v1","updated":"2023-10-12T05:34:45Z","published":"2023-10-12T05:34:45Z","title":"EC-Depth: Exploring the consistency of self-supervised monocular depth\n estimation under challenging scenes","summary":" Self-supervised monocular depth estimation holds significant importance in\nthe fields of autonomous driving and robotics. However, existing methods are\ntypically designed to train and test on clear and pristine datasets,\noverlooking the impact of various adverse conditions prevalent in real-world\nscenarios. As a result, it is commonly observed that most self-supervised\nmonocular depth estimation methods struggle to perform adequately under\nchallenging conditions. To address this issue, we present EC-Depth, a novel\nself-supervised two-stage training framework to achieve a robust depth\nestimation, starting from the foundation of depth prediction consistency under\ndifferent perturbations. Leveraging the proposed perturbation-invariant depth\nconsistency constraint module and the consistency-based pseudo-label selection\nmodule, our model attains accurate and consistent depth predictions in both\nstandard and challenging scenarios. Extensive experiments substantiate the\neffectiveness of the proposed method. Moreover, our method surpasses existing\nstate-of-the-art methods on KITTI, KITTI-C and DrivingStereo benchmarks,\ndemonstrating its potential for enhancing the reliability of self-supervised\nmonocular depth estimation models in real-world applications.\n","authors":["Ruijie Zhu","Ziyang Song","Chuxin Wang","Jianfeng He","Tianzhu Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.08044v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08042v1","updated":"2023-10-12T05:33:25Z","published":"2023-10-12T05:33:25Z","title":"X-HRNet: Towards Lightweight Human Pose Estimation with Spatially\n Unidimensional Self-Attention","summary":" High-resolution representation is necessary for human pose estimation to\nachieve high performance, and the ensuing problem is high computational\ncomplexity. In particular, predominant pose estimation methods estimate human\njoints by 2D single-peak heatmaps. Each 2D heatmap can be horizontally and\nvertically projected to and reconstructed by a pair of 1D heat vectors.\nInspired by this observation, we introduce a lightweight and powerful\nalternative, Spatially Unidimensional Self-Attention (SUSA), to the pointwise\n(1x1) convolution that is the main computational bottleneck in the depthwise\nseparable 3c3 convolution. Our SUSA reduces the computational complexity of the\npointwise (1x1) convolution by 96% without sacrificing accuracy. Furthermore,\nwe use the SUSA as the main module to build our lightweight pose estimation\nbackbone X-HRNet, where `X' represents the estimated cross-shape attention\nvectors. Extensive experiments on the COCO benchmark demonstrate the\nsuperiority of our X-HRNet, and comprehensive ablation studies show the\neffectiveness of the SUSA modules. The code is publicly available at\nhttps://github.com/cool-xuan/x-hrnet.\n","authors":["Yixuan Zhou","Xuanhan Wang","Xing Xu","Lei Zhao","Jingkuan Song"],"pdf_url":"https://arxiv.org/pdf/2310.08042v1.pdf","comment":"Accepted by ICME 2022"},{"id":"http://arxiv.org/abs/2310.05624v2","updated":"2023-10-12T05:33:19Z","published":"2023-10-09T11:26:58Z","title":"Locality-Aware Generalizable Implicit Neural Representation","summary":" Generalizable implicit neural representation (INR) enables a single\ncontinuous function, i.e., a coordinate-based neural network, to represent\nmultiple data instances by modulating its weights or intermediate features\nusing latent codes. However, the expressive power of the state-of-the-art\nmodulation is limited due to its inability to localize and capture fine-grained\ndetails of data entities such as specific pixels and rays. To address this\nissue, we propose a novel framework for generalizable INR that combines a\ntransformer encoder with a locality-aware INR decoder. The transformer encoder\npredicts a set of latent tokens from a data instance to encode local\ninformation into each latent token. The locality-aware INR decoder extracts a\nmodulation vector by selectively aggregating the latent tokens via\ncross-attention for a coordinate input and then predicts the output by\nprogressively decoding with coarse-to-fine modulation through multiple\nfrequency bandwidths. The selective token aggregation and the multi-band\nfeature modulation enable us to learn locality-aware representation in spatial\nand spectral aspects, respectively. Our framework significantly outperforms\nprevious generalizable INRs and validates the usefulness of the locality-aware\nlatents for downstream tasks such as image generation.\n","authors":["Doyup Lee","Chiheon Kim","Minsu Cho","Wook-Shin Han"],"pdf_url":"https://arxiv.org/pdf/2310.05624v2.pdf","comment":"19 pages, 12 figures"},{"id":"http://arxiv.org/abs/2310.08038v1","updated":"2023-10-12T05:09:27Z","published":"2023-10-12T05:09:27Z","title":"Continual Learning via Manifold Expansion Replay","summary":" In continual learning, the learner learns multiple tasks in sequence, with\ndata being acquired only once for each task. Catastrophic forgetting is a major\nchallenge to continual learning. To reduce forgetting, some existing\nrehearsal-based methods use episodic memory to replay samples of previous\ntasks. However, in the process of knowledge integration when learning a new\ntask, this strategy also suffers from catastrophic forgetting due to an\nimbalance between old and new knowledge. To address this problem, we propose a\nnovel replay strategy called Manifold Expansion Replay (MaER). We argue that\nexpanding the implicit manifold of the knowledge representation in the episodic\nmemory helps to improve the robustness and expressiveness of the model. To this\nend, we propose a greedy strategy to keep increasing the diameter of the\nimplicit manifold represented by the knowledge in the buffer during memory\nmanagement. In addition, we introduce Wasserstein distance instead of cross\nentropy as distillation loss to preserve previous knowledge. With extensive\nexperimental validation on MNIST, CIFAR10, CIFAR100, and TinyImageNet, we show\nthat the proposed method significantly improves the accuracy in continual\nlearning setup, outperforming the state of the arts.\n","authors":["Zihao Xu","Xuan Tang","Yufei Shi","Jianfeng Zhang","Jian Yang","Mingsong Chen","Xian Wei"],"pdf_url":"https://arxiv.org/pdf/2310.08038v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08035v1","updated":"2023-10-12T05:03:19Z","published":"2023-10-12T05:03:19Z","title":"BaSAL: Size Balanced Warm Start Active Learning for LiDAR Semantic\n Segmentation","summary":" Active learning strives to reduce the need for costly data annotation, by\nrepeatedly querying an annotator to label the most informative samples from a\npool of unlabeled data and retraining a model from these samples. We identify\ntwo problems with existing active learning methods for LiDAR semantic\nsegmentation. First, they ignore the severe class imbalance inherent in LiDAR\nsemantic segmentation datasets. Second, to bootstrap the active learning loop,\nthey train their initial model from randomly selected data samples, which leads\nto low performance and is referred to as the cold start problem. To address\nthese problems we propose BaSAL, a size-balanced warm start active learning\nmodel, based on the observation that each object class has a characteristic\nsize. By sampling object clusters according to their size, we can thus create a\nsize-balanced dataset that is also more class-balanced. Furthermore, in\ncontrast to existing information measures like entropy or CoreSet, size-based\nsampling does not require an already trained model and thus can be used to\naddress the cold start problem. Results show that we are able to improve the\nperformance of the initial model by a large margin. Combining size-balanced\nsampling and warm start with established information measures, our approach\nachieves a comparable performance to training on the entire SemanticKITTI\ndataset, despite using only 5% of the annotations, which outperforms existing\nactive learning methods. We also match the existing state-of-the-art in active\nlearning on nuScenes. Our code will be made available upon paper acceptance.\n","authors":["Jiarong Wei","Yancong Lin","Holger Caesar"],"pdf_url":"https://arxiv.org/pdf/2310.08035v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08027v1","updated":"2023-10-12T04:14:28Z","published":"2023-10-12T04:14:28Z","title":"Exploring Large Language Models for Multi-Modal Out-of-Distribution\n Detection","summary":" Out-of-distribution (OOD) detection is essential for reliable and trustworthy\nmachine learning. Recent multi-modal OOD detection leverages textual\ninformation from in-distribution (ID) class names for visual OOD detection, yet\nit currently neglects the rich contextual information of ID classes. Large\nlanguage models (LLMs) encode a wealth of world knowledge and can be prompted\nto generate descriptive features for each class. Indiscriminately using such\nknowledge causes catastrophic damage to OOD detection due to LLMs'\nhallucinations, as is observed by our analysis. In this paper, we propose to\napply world knowledge to enhance OOD detection performance through selective\ngeneration from LLMs. Specifically, we introduce a consistency-based\nuncertainty calibration method to estimate the confidence score of each\ngeneration. We further extract visual objects from each image to fully\ncapitalize on the aforementioned world knowledge. Extensive experiments\ndemonstrate that our method consistently outperforms the state-of-the-art.\n","authors":["Yi Dai","Hao Lang","Kaisheng Zeng","Fei Huang","Yongbin Li"],"pdf_url":"https://arxiv.org/pdf/2310.08027v1.pdf","comment":"EMNLP2023 Findings Long Paper"},{"id":"http://arxiv.org/abs/2310.08026v1","updated":"2023-10-12T04:12:43Z","published":"2023-10-12T04:12:43Z","title":"Beyond Sharing Weights in Decoupling Feature Learning Network for UAV\n RGB-Infrared Vehicle Re-Identification","summary":" Owing to the capacity of performing full-time target search, cross-modality\nvehicle re-identification (Re-ID) based on unmanned aerial vehicle (UAV) is\ngaining more attention in both video surveillance and public security. However,\nthis promising and innovative research has not been studied sufficiently due to\nthe data inadequacy issue. Meanwhile, the cross-modality discrepancy and\norientation discrepancy challenges further aggravate the difficulty of this\ntask. To this end, we pioneer a cross-modality vehicle Re-ID benchmark named\nUAV Cross-Modality Vehicle Re-ID (UCM-VeID), containing 753 identities with\n16015 RGB and 13913 infrared images. Moreover, to meet cross-modality\ndiscrepancy and orientation discrepancy challenges, we present a hybrid weights\ndecoupling network (HWDNet) to learn the shared discriminative\norientation-invariant features. For the first challenge, we proposed a hybrid\nweights siamese network with a well-designed weight restrainer and its\ncorresponding objective function to learn both modality-specific and modality\nshared information. In terms of the second challenge, three effective\ndecoupling structures with two pretext tasks are investigated to learn\norientation-invariant feature. Comprehensive experiments are carried out to\nvalidate the effectiveness of the proposed method. The dataset and codes will\nbe released at https://github.com/moonstarL/UAV-CM-VeID.\n","authors":["Xingyue Liu","Jiahao Qi","Chen Chen","Kangcheng Bin","Ping Zhong"],"pdf_url":"https://arxiv.org/pdf/2310.08026v1.pdf","comment":"13 pages, 10 figures, 64 citations, submitted to TMM"},{"id":"http://arxiv.org/abs/2210.15889v4","updated":"2023-10-12T04:05:41Z","published":"2022-10-28T04:38:10Z","title":"Towards Data-and Knowledge-Driven Artificial Intelligence: A Survey on\n Neuro-Symbolic Computing","summary":" Neural-symbolic computing (NeSy), which pursues the integration of the\nsymbolic and statistical paradigms of cognition, has been an active research\narea of Artificial Intelligence (AI) for many years. As NeSy shows promise of\nreconciling the advantages of reasoning and interpretability of symbolic\nrepresentation and robust learning in neural networks, it may serve as a\ncatalyst for the next generation of AI. In the present paper, we provide a\nsystematic overview of the recent developments and important contributions of\nNeSy research. Firstly, we introduce study history of this area, covering early\nwork and foundations. We further discuss background concepts and identify key\ndriving factors behind the development of NeSy. Afterward, we categorize recent\nlandmark approaches along several main characteristics that underline this\nresearch paradigm, including neural-symbolic integration, knowledge\nrepresentation, knowledge embedding, and functionality. Next, we briefly\ndiscuss the successful application of modern NeSy approaches in several\ndomains. Then, we benchmark several NeSy methods on three representative\napplication tasks. Finally, we identify the open problems together with\npotential future research directions. This survey is expected to help new\nresearchers enter this rapidly evolving field and accelerate the progress\ntowards data-and knowledge-driven AI.\n","authors":["Wenguan Wang","Yi Yang","Fei Wu"],"pdf_url":"https://arxiv.org/pdf/2210.15889v4.pdf","comment":"Ongoing project"},{"id":"http://arxiv.org/abs/2308.03807v2","updated":"2023-10-12T03:36:17Z","published":"2023-08-06T15:47:03Z","title":"Nest-DGIL: Nesterov-optimized Deep Geometric Incremental Learning for CS\n Image Reconstruction","summary":" Proximal gradient-based optimization is one of the most common strategies to\nsolve inverse problem of images, and it is easy to implement. However, these\ntechniques often generate heavy artifacts in image reconstruction. One of the\nmost popular refinement methods is to fine-tune the regularization parameter to\nalleviate such artifacts, but it may not always be sufficient or applicable due\nto increased computational costs. In this work, we propose a deep geometric\nincremental learning framework based on the second Nesterov proximal gradient\noptimization. The proposed end-to-end network not only has the powerful\nlearning ability for high-/low-frequency image features, but also can\ntheoretically guarantee that geometric texture details will be reconstructed\nfrom preliminary linear reconstruction. Furthermore, it can avoid the risk of\nintermediate reconstruction results falling outside the geometric decomposition\ndomains and achieve fast convergence. Our reconstruction framework is\ndecomposed into four modules including general linear reconstruction, cascade\ngeometric incremental restoration, Nesterov acceleration, and post-processing.\nIn the image restoration step, a cascade geometric incremental learning module\nis designed to compensate for missing texture information from different\ngeometric spectral decomposition domains. Inspired by the overlap-tile\nstrategy, we also develop a post-processing module to remove the block effect\nin patch-wise-based natural image reconstruction. All parameters in the\nproposed model are learnable, an adaptive initialization technique of physical\nparameters is also employed to make model flexibility and ensure converging\nsmoothly. We compare the reconstruction performance of the proposed method with\nexisting state-of-the-art methods to demonstrate its superiority. Our source\ncodes are available at https://github.com/fanxiaohong/Nest-DGIL.\n","authors":["Xiaohong Fan","Yin Yang","Ke Chen","Yujie Feng","Jianping Zhang"],"pdf_url":"https://arxiv.org/pdf/2308.03807v2.pdf","comment":"15 pages,our source codes are available at\n https://github.com/fanxiaohong/Nest-DGIL"},{"id":"http://arxiv.org/abs/2310.06488v2","updated":"2023-10-12T03:23:40Z","published":"2023-10-10T09:57:17Z","title":"SpikeCLIP: A Contrastive Language-Image Pretrained Spiking Neural\n Network","summary":" Spiking neural networks (SNNs) have demonstrated the capability to achieve\ncomparable performance to deep neural networks (DNNs) in both visual and\nlinguistic domains while offering the advantages of improved energy efficiency\nand adherence to biological plausibility. However, the extension of such\nsingle-modality SNNs into the realm of multimodal scenarios remains an\nunexplored territory. Drawing inspiration from the concept of contrastive\nlanguage-image pre-training (CLIP), we introduce a novel framework, named\nSpikeCLIP, to address the gap between two modalities within the context of\nspike-based computing through a two-step recipe involving ``Alignment\nPre-training + Dual-Loss Fine-tuning\". Extensive experiments demonstrate that\nSNNs achieve comparable results to their DNN counterparts while significantly\nreducing energy consumption across a variety of datasets commonly used for\nmultimodal model evaluation. Furthermore, SpikeCLIP maintains robust\nperformance in image classification tasks that involve class labels not\npredefined within specific categories.\n","authors":["Tianlong Li","Wenhao Liu","Changze Lv","Jianhan Xu","Cenyuan Zhang","Muling Wu","Xiaoqing Zheng","Xuanjing Huang"],"pdf_url":"https://arxiv.org/pdf/2310.06488v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08009v1","updated":"2023-10-12T03:21:12Z","published":"2023-10-12T03:21:12Z","title":"Dual-Stream Knowledge-Preserving Hashing for Unsupervised Video\n Retrieval","summary":" Unsupervised video hashing usually optimizes binary codes by learning to\nreconstruct input videos. Such reconstruction constraint spends much effort on\nframe-level temporal context changes without focusing on video-level global\nsemantics that are more useful for retrieval. Hence, we address this problem by\ndecomposing video information into reconstruction-dependent and\nsemantic-dependent information, which disentangles the semantic extraction from\nreconstruction constraint. Specifically, we first design a simple dual-stream\nstructure, including a temporal layer and a hash layer. Then, with the help of\nsemantic similarity knowledge obtained from self-supervision, the hash layer\nlearns to capture information for semantic retrieval, while the temporal layer\nlearns to capture the information for reconstruction. In this way, the model\nnaturally preserves the disentangled semantics into binary codes. Validated by\ncomprehensive experiments, our method consistently outperforms the\nstate-of-the-arts on three video benchmarks.\n","authors":["Pandeng Li","Hongtao Xie","Jiannan Ge","Lei Zhang","Shaobo Min","Yongdong Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.08009v1.pdf","comment":"17 pages, 8 figures, ECCV 2022"},{"id":"http://arxiv.org/abs/2310.08002v1","updated":"2023-10-12T03:14:02Z","published":"2023-10-12T03:14:02Z","title":"MLP-AMDC: An MLP Architecture for Adaptive-Mask-based Dual-Camera\n snapshot hyperspectral imaging","summary":" Coded Aperture Snapshot Spectral Imaging (CASSI) system has great advantages\nover traditional methods in dynamically acquiring Hyper-Spectral Image (HSI),\nbut there are the following problems. 1) Traditional mask relies on random\npatterns or analytical design, both of which limit the performance improvement\nof CASSI. 2) Existing high-quality reconstruction algorithms are slow in\nreconstruction and can only reconstruct scene information offline. To address\nthe above two problems, this paper designs the AMDC-CASSI system, introducing\nRGB camera with CASSI based on Adaptive-Mask as multimodal input to improve the\nreconstruction quality. The existing SOTA reconstruction schemes are based on\ntransformer, but the operation of self-attention pulls down the operation\nefficiency of the network. In order to improve the inference speed of the\nreconstruction network, this paper proposes An MLP Architecture for\nAdaptive-Mask-based Dual-Camera (MLP-AMDC) to replace the transformer structure\nof the network. Numerous experiments have shown that MLP performs no less well\nthan transformer-based structures for HSI reconstruction, while MLP greatly\nimproves the network inference speed and has less number of parameters and\noperations, our method has a 8 db improvement over SOTA and at least a 5-fold\nimprovement in reconstruction speed. (https://github.com/caizeyu1992/MLP-AMDC.)\n","authors":["Zeyu Cai","Can Zhang","Xunhao Chen","Shanghuan Liu","Chengqian Jin","Feipeng Da"],"pdf_url":"https://arxiv.org/pdf/2310.08002v1.pdf","comment":"arXiv admin note: substantial text overlap with arXiv:2308.01541"},{"id":"http://arxiv.org/abs/2209.05167v3","updated":"2023-10-12T03:09:25Z","published":"2022-09-12T11:51:20Z","title":"LF-VISLAM: A SLAM Framework for Large Field-of-View Cameras with\n Negative Imaging Plane on Mobile Agents","summary":" Simultaneous Localization And Mapping (SLAM) has become a crucial aspect in\nthe fields of autonomous driving and robotics. One crucial component of visual\nSLAM is the Field-of-View (FoV) of the camera, as a larger FoV allows for a\nwider range of surrounding elements and features to be perceived. However, when\nthe FoV of the camera reaches the negative half-plane, traditional methods for\nrepresenting image feature points using [u,v,1]^T become ineffective. While the\npanoramic FoV is advantageous for loop closure, its benefits are not easily\nrealized under large-attitude-angle differences where loop-closure frames\ncannot be easily matched by existing methods. As loop closure on wide-FoV\npanoramic data further comes with a large number of outliers, traditional\noutlier rejection methods are not directly applicable. To address these issues,\nwe propose LF-VISLAM, a Visual Inertial SLAM framework for cameras with\nextremely Large FoV with loop closure. A three-dimensional vector with unit\nlength is introduced to effectively represent feature points even on the\nnegative half-plane. The attitude information of the SLAM system is leveraged\nto guide the feature point detection of the loop closure. Additionally, a new\noutlier rejection method based on the unit length representation is integrated\ninto the loop closure module. We collect the PALVIO dataset using a Panoramic\nAnnular Lens (PAL) system with an entire FoV of 360{\\deg}x(40{\\deg}~120{\\deg})\nand an Inertial Measurement Unit (IMU) for Visual Inertial Odometry (VIO) to\naddress the lack of panoramic SLAM datasets. Experiments on the established\nPALVIO and public datasets show that the proposed LF-VISLAM outperforms\nstate-of-the-art SLAM methods. Our code will be open-sourced at\nhttps://github.com/flysoaryun/LF-VISLAM.\n","authors":["Ze Wang","Kailun Yang","Hao Shi","Peng Li","Fei Gao","Jian Bai","Kaiwei Wang"],"pdf_url":"https://arxiv.org/pdf/2209.05167v3.pdf","comment":"Accepted to IEEE Transactions on Automation Science and Engineering\n (T-ASE). Extended version of IROS2022 paper arXiv:2202.12613. Code and\n dataset will be open-sourced at https://github.com/flysoaryun/LF-SLAM"},{"id":"http://arxiv.org/abs/2310.07997v1","updated":"2023-10-12T02:52:33Z","published":"2023-10-12T02:52:33Z","title":"Point-NeuS: Point-Guided Neural Implicit Surface Reconstruction by\n Volume Rendering","summary":" Recently, learning neural implicit surface by volume rendering has been a\npromising way for multi-view reconstruction. However, limited accuracy and\nexcessive time complexity remain bottlenecks that current methods urgently need\nto overcome. To address these challenges, we propose a new method called\nPoint-NeuS, utilizing point-guided mechanisms to achieve accurate and efficient\nreconstruction. Point modeling is organically embedded into the volume\nrendering to enhance and regularize the representation of implicit surface.\nSpecifically, to achieve precise point guidance and noise robustness, aleatoric\nuncertainty of the point cloud is modeled to capture the distribution of noise\nand estimate the reliability of points. Additionally, a Neural Projection\nmodule connecting points and images is introduced to add geometric constraints\nto the Signed Distance Function (SDF). To better compensate for geometric bias\nbetween volume rendering and point modeling, high-fidelity points are filtered\ninto an Implicit Displacement Network to improve the representation of SDF.\nBenefiting from our effective point guidance, lightweight networks are employed\nto achieve an impressive 11x speedup compared to NeuS. Extensive experiments\nshow that our method yields high-quality surfaces, especially for fine-grained\ndetails and smooth regions. Moreover, it exhibits strong robustness to both\nnoisy and sparse data.\n","authors":["Chen Zhang","Wanjuan Su","Wenbing Tao"],"pdf_url":"https://arxiv.org/pdf/2310.07997v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07996v1","updated":"2023-10-12T02:52:14Z","published":"2023-10-12T02:52:14Z","title":"Reset It and Forget It: Relearning Last-Layer Weights Improves Continual\n and Transfer Learning","summary":" This work identifies a simple pre-training mechanism that leads to\nrepresentations exhibiting better continual and transfer learning. This\nmechanism -- the repeated resetting of weights in the last layer, which we\nnickname \"zapping\" -- was originally designed for a meta-continual-learning\nprocedure, yet we show it is surprisingly applicable in many settings beyond\nboth meta-learning and continual learning. In our experiments, we wish to\ntransfer a pre-trained image classifier to a new set of classes, in a few\nshots. We show that our zapping procedure results in improved transfer accuracy\nand/or more rapid adaptation in both standard fine-tuning and continual\nlearning settings, while being simple to implement and computationally\nefficient. In many cases, we achieve performance on par with state of the art\nmeta-learning without needing the expensive higher-order gradients, by using a\ncombination of zapping and sequential learning. An intuitive explanation for\nthe effectiveness of this zapping procedure is that representations trained\nwith repeated zapping learn features that are capable of rapidly adapting to\nnewly initialized classifiers. Such an approach may be considered a\ncomputationally cheaper type of, or alternative to, meta-learning rapidly\nadaptable features with higher-order gradients. This adds to recent work on the\nusefulness of resetting neural network parameters during training, and invites\nfurther investigation of this mechanism.\n","authors":["Lapo Frati","Neil Traft","Jeff Clune","Nick Cheney"],"pdf_url":"https://arxiv.org/pdf/2310.07996v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07995v1","updated":"2023-10-12T02:49:00Z","published":"2023-10-12T02:49:00Z","title":"HeightFormer: A Multilevel Interaction and Image-adaptive\n Classification-regression Network for Monocular Height Estimation with Aerial\n Images","summary":" Height estimation has long been a pivotal topic within measurement and remote\nsensing disciplines, proving critical for endeavours such as 3D urban\nmodelling, MR and autonomous driving. Traditional methods utilise stereo\nmatching or multisensor fusion, both well-established techniques that typically\nnecessitate multiple images from varying perspectives and adjunct sensors like\nSAR, leading to substantial deployment costs. Single image height estimation\nhas emerged as an attractive alternative, boasting a larger data source variety\nand simpler deployment. However, current methods suffer from limitations such\nas fixed receptive fields, a lack of global information interaction, leading to\nnoticeable instance-level height deviations. The inherent complexity of height\nprediction can result in a blurry estimation of object edge depth when using\nmainstream regression methods based on fixed height division. This paper\npresents a comprehensive solution for monocular height estimation in remote\nsensing, termed HeightFormer, combining multilevel interactions and\nimage-adaptive classification-regression. It features the Multilevel\nInteraction Backbone (MIB) and Image-adaptive Classification-regression Height\nGenerator (ICG). MIB supplements the fixed sample grid in CNN of the\nconventional backbone network with tokens of different interaction ranges. It\nis complemented by a pixel-, patch-, and feature map-level hierarchical\ninteraction mechanism, designed to relay spatial geometry information across\ndifferent scales and introducing a global receptive field to enhance the\nquality of instance-level height estimation. The ICG dynamically generates\nheight partition for each image and reframes the traditional regression task,\nusing a refinement from coarse to fine classification-regression that\nsignificantly mitigates the innate ill-posedness issue and drastically improves\nedge sharpness.\n","authors":["Zhan Chen","Yidan Zhang","Xiyu Qi","Yongqiang Mao","Xin Zhou","Lulu Niu","Hui Wu","Lei Wang","Yunping Ge"],"pdf_url":"https://arxiv.org/pdf/2310.07995v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.12535v2","updated":"2023-10-12T02:38:50Z","published":"2023-03-21T17:28:44Z","title":"An Effective Motion-Centric Paradigm for 3D Single Object Tracking in\n Point Clouds","summary":" 3D single object tracking in LiDAR point clouds (LiDAR SOT) plays a crucial\nrole in autonomous driving. Current approaches all follow the Siamese paradigm\nbased on appearance matching. However, LiDAR point clouds are usually\ntextureless and incomplete, which hinders effective appearance matching.\nBesides, previous methods greatly overlook the critical motion clues among\ntargets. In this work, beyond 3D Siamese tracking, we introduce a\nmotion-centric paradigm to handle LiDAR SOT from a new perspective. Following\nthis paradigm, we propose a matching-free two-stage tracker M^2-Track. At the\n1st-stage, M^2-Track localizes the target within successive frames via motion\ntransformation. Then it refines the target box through motion-assisted shape\ncompletion at the 2nd-stage. Due to the motion-centric nature, our method shows\nits impressive generalizability with limited training labels and provides good\ndifferentiability for end-to-end cycle training. This inspires us to explore\nsemi-supervised LiDAR SOT by incorporating a pseudo-label-based motion\naugmentation and a self-supervised loss term. Under the fully-supervised\nsetting, extensive experiments confirm that M^2-Track significantly outperforms\nprevious state-of-the-arts on three large-scale datasets while running at 57FPS\n(~3%, ~11% and ~22% precision gains on KITTI, NuScenes, and Waymo Open Dataset\nrespectively). While under the semi-supervised setting, our method performs on\npar with or even surpasses its fully-supervised counterpart using fewer than\nhalf of the labels from KITTI. Further analysis verifies each component's\neffectiveness and shows the motion-centric paradigm's promising potential for\nauto-labeling and unsupervised domain adaptation.\n","authors":["Chaoda Zheng","Xu Yan","Haiming Zhang","Baoyuan Wang","Shenghui Cheng","Shuguang Cui","Zhen Li"],"pdf_url":"https://arxiv.org/pdf/2303.12535v2.pdf","comment":"Accepted version of the journal extension of M^2-Track. Accepted by\n TPAMI. arXiv admin note: substantial text overlap with arXiv:2203.01730"},{"id":"http://arxiv.org/abs/2207.12389v2","updated":"2023-10-12T02:01:50Z","published":"2022-07-25T17:55:28Z","title":"MemSAC: Memory Augmented Sample Consistency for Large Scale Unsupervised\n Domain Adaptation","summary":" Practical real world datasets with plentiful categories introduce new\nchallenges for unsupervised domain adaptation like small inter-class\ndiscriminability, that existing approaches relying on domain invariance alone\ncannot handle sufficiently well. In this work we propose MemSAC, which exploits\nsample level similarity across source and target domains to achieve\ndiscriminative transfer, along with architectures that scale to a large number\nof categories. For this purpose, we first introduce a memory augmented approach\nto efficiently extract pairwise similarity relations between labeled source and\nunlabeled target domain instances, suited to handle an arbitrary number of\nclasses. Next, we propose and theoretically justify a novel variant of the\ncontrastive loss to promote local consistency among within-class cross domain\nsamples while enforcing separation between classes, thus preserving\ndiscriminative transfer from source to target. We validate the advantages of\nMemSAC with significant improvements over previous state-of-the-art on multiple\nchallenging transfer tasks designed for large-scale adaptation, such as\nDomainNet with 345 classes and fine-grained adaptation on Caltech-UCSD birds\ndataset with 200 classes. We also provide in-depth analysis and insights into\nthe effectiveness of MemSAC.\n","authors":["Tarun Kalluri","Astuti Sharma","Manmohan Chandraker"],"pdf_url":"https://arxiv.org/pdf/2207.12389v2.pdf","comment":"Accepted at ECCV 2022. Project Webpage:\n https://tarun005.github.io/MemSAC/"},{"id":"http://arxiv.org/abs/2310.07975v1","updated":"2023-10-12T01:47:55Z","published":"2023-10-12T01:47:55Z","title":"Self-supervised visual learning for analyzing firearms trafficking\n activities on the Web","summary":" Automated visual firearms classification from RGB images is an important\nreal-world task with applications in public space security, intelligence\ngathering and law enforcement investigations. When applied to images massively\ncrawled from the World Wide Web (including social media and dark Web sites), it\ncan serve as an important component of systems that attempt to identify\ncriminal firearms trafficking networks, by analyzing Big Data from open-source\nintelligence. Deep Neural Networks (DNN) are the state-of-the-art methodology\nfor achieving this, with Convolutional Neural Networks (CNN) being typically\nemployed. The common transfer learning approach consists of pretraining on a\nlarge-scale, generic annotated dataset for whole-image classification, such as\nImageNet-1k, and then finetuning the DNN on a smaller, annotated,\ntask-specific, downstream dataset for visual firearms classification. Neither\nVisual Transformer (ViT) neural architectures nor Self-Supervised Learning\n(SSL) approaches have been so far evaluated on this critical task. SSL\nessentially consists of replacing the traditional supervised pretraining\nobjective with an unsupervised pretext task that does not require ground-truth\nlabels..\n","authors":["Sotirios Konstantakos","Despina Ioanna Chalkiadaki","Ioannis Mademlis","Adamantia Anna Rebolledo Chrysochoou","Georgios Th. Papadopoulos"],"pdf_url":"https://arxiv.org/pdf/2310.07975v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.01244v2","updated":"2023-10-12T01:44:33Z","published":"2022-10-03T21:50:14Z","title":"Event-based Temporally Dense Optical Flow Estimation with Sequential\n Learning","summary":" Event cameras provide an advantage over traditional frame-based cameras when\ncapturing fast-moving objects without a motion blur. They achieve this by\nrecording changes in light intensity (known as events), thus allowing them to\noperate at a much higher frequency and making them suitable for capturing\nmotions in a highly dynamic scene. Many recent studies have proposed methods to\ntrain neural networks (NNs) for predicting optical flow from events. However,\nthey often rely on a spatio-temporal representation constructed from events\nover a fixed interval, such as 10Hz used in training on the DSEC dataset. This\nlimitation restricts the flow prediction to the same interval (10Hz) whereas\nthe fast speed of event cameras, which can operate up to 3kHz, has not been\neffectively utilized. In this work, we show that a temporally dense flow\nestimation at 100Hz can be achieved by treating the flow estimation as a\nsequential problem using two different variants of recurrent networks -\nLong-short term memory (LSTM) and spiking neural network (SNN). First, We\nutilize the NN model constructed similar to the popular EV-FlowNet but with\nLSTM layers to demonstrate the efficiency of our training method. The model not\nonly produces 10x more frequent optical flow than the existing ones, but the\nestimated flows also have 13% lower errors than predictions from the baseline\nEV-FlowNet. Second, we construct an EV-FlowNet SNN but with leaky integrate and\nfire neurons to efficiently capture the temporal dynamics. We found that simple\ninherent recurrent dynamics of SNN lead to significant parameter reduction\ncompared to the LSTM model. In addition, because of its event-driven\ncomputation, the spiking model is estimated to consume only 1.5% energy of the\nLSTM model, highlighting the efficiency of SNN in processing events and the\npotential for achieving temporally dense flow.\n","authors":["Wachirawit Ponghiran","Chamika Mihiranga Liyanagedera","Kaushik Roy"],"pdf_url":"https://arxiv.org/pdf/2210.01244v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07969v1","updated":"2023-10-12T01:25:21Z","published":"2023-10-12T01:25:21Z","title":"CleftGAN: Adapting A Style-Based Generative Adversarial Network To\n Create Images Depicting Cleft Lip Deformity","summary":" A major obstacle when attempting to train a machine learning system to\nevaluate facial clefts is the scarcity of large datasets of high-quality,\nethics board-approved patient images. In response, we have built a deep\nlearning-based cleft lip generator designed to produce an almost unlimited\nnumber of artificial images exhibiting high-fidelity facsimiles of cleft lip\nwith wide variation. We undertook a transfer learning protocol testing\ndifferent versions of StyleGAN-ADA (a generative adversarial network image\ngenerator incorporating adaptive data augmentation (ADA)) as the base model.\nTraining images depicting a variety of cleft deformities were pre-processed to\nadjust for rotation, scaling, color adjustment and background blurring. The ADA\nmodification of the primary algorithm permitted construction of our new\ngenerative model while requiring input of a relatively small number of training\nimages. Adversarial training was carried out using 514 unique frontal\nphotographs of cleft-affected faces to adapt a pre-trained model based on\n70,000 normal faces. The Frechet Inception Distance (FID) was used to measure\nthe similarity of the newly generated facial images to the cleft training\ndataset, while Perceptual Path Length (PPL) and the novel Divergence Index of\nSeverity Histograms (DISH) measures were also used to assess the performance of\nthe image generator that we dub CleftGAN. We found that StyleGAN3 with\ntranslation invariance (StyleGAN3-t) performed optimally as a base model.\nGenerated images achieved a low FID reflecting a close similarity to our\ntraining input dataset of genuine cleft images. Low PPL and DISH measures\nreflected a smooth and semantically valid interpolation of images through the\ntransfer learning process and a similar distribution of severity in the\ntraining and generated images, respectively.\n","authors":["Abdullah Hayajneh","Erchin Serpedin","Mohammad Shaqfeh","Graeme Glass","Mitchell A. Stotland"],"pdf_url":"https://arxiv.org/pdf/2310.07969v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.03270v3","updated":"2023-10-12T01:13:41Z","published":"2023-10-05T02:51:53Z","title":"EfficientDM: Efficient Quantization-Aware Fine-Tuning of Low-Bit\n Diffusion Models","summary":" Diffusion models have demonstrated remarkable capabilities in image synthesis\nand related generative tasks. Nevertheless, their practicality for low-latency\nreal-world applications is constrained by substantial computational costs and\nlatency issues. Quantization is a dominant way to compress and accelerate\ndiffusion models, where post-training quantization (PTQ) and quantization-aware\ntraining (QAT) are two main approaches, each bearing its own properties. While\nPTQ exhibits efficiency in terms of both time and data usage, it may lead to\ndiminished performance in low bit-width. On the other hand, QAT can alleviate\nperformance degradation but comes with substantial demands on computational and\ndata resources. To capitalize on the advantages while avoiding their respective\ndrawbacks, we introduce a data-free and parameter-efficient fine-tuning\nframework for low-bit diffusion models, dubbed EfficientDM, to achieve\nQAT-level performance with PTQ-like efficiency. Specifically, we propose a\nquantization-aware variant of the low-rank adapter (QALoRA) that can be merged\nwith model weights and jointly quantized to low bit-width. The fine-tuning\nprocess distills the denoising capabilities of the full-precision model into\nits quantized counterpart, eliminating the requirement for training data. We\nalso introduce scale-aware optimization and employ temporal learned step-size\nquantization to further enhance performance. Extensive experimental results\ndemonstrate that our method significantly outperforms previous PTQ-based\ndiffusion models while maintaining similar time and data efficiency.\nSpecifically, there is only a marginal 0.05 sFID increase when quantizing both\nweights and activations of LDM-4 to 4-bit on ImageNet 256x256. Compared to\nQAT-based methods, our EfficientDM also boasts a 16.2x faster quantization\nspeed with comparable generation quality.\n","authors":["Yefei He","Jing Liu","Weijia Wu","Hong Zhou","Bohan Zhuang"],"pdf_url":"https://arxiv.org/pdf/2310.03270v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.00613v3","updated":"2023-10-12T00:27:09Z","published":"2022-12-01T16:09:54Z","title":"NeuWigs: A Neural Dynamic Model for Volumetric Hair Capture and\n Animation","summary":" The capture and animation of human hair are two of the major challenges in\nthe creation of realistic avatars for the virtual reality. Both problems are\nhighly challenging, because hair has complex geometry and appearance, as well\nas exhibits challenging motion. In this paper, we present a two-stage approach\nthat models hair independently from the head to address these challenges in a\ndata-driven manner. The first stage, state compression, learns a\nlow-dimensional latent space of 3D hair states containing motion and\nappearance, via a novel autoencoder-as-a-tracker strategy. To better\ndisentangle the hair and head in appearance learning, we employ multi-view hair\nsegmentation masks in combination with a differentiable volumetric renderer.\nThe second stage learns a novel hair dynamics model that performs temporal hair\ntransfer based on the discovered latent codes. To enforce higher stability\nwhile driving our dynamics model, we employ the 3D point-cloud autoencoder from\nthe compression stage for de-noising of the hair state. Our model outperforms\nthe state of the art in novel view synthesis and is capable of creating novel\nhair animations without having to rely on hair observations as a driving\nsignal. Project page is here https://ziyanw1.github.io/neuwigs/.\n","authors":["Ziyan Wang","Giljoo Nam","Tuur Stuyck","Stephen Lombardi","Chen Cao","Jason Saragih","Michael Zollhoefer","Jessica Hodgins","Christoph Lassner"],"pdf_url":"https://arxiv.org/pdf/2212.00613v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2301.08141v2","updated":"2023-10-12T23:50:05Z","published":"2023-01-11T22:47:12Z","title":"Self-supervised Learning for Segmentation and Quantification of Dopamine\n Neurons in Parkinson's Disease","summary":" Parkinson's Disease (PD) is the second most common neurodegenerative disease\nin humans. PD is characterized by the gradual loss of dopaminergic neurons in\nthe Substantia Nigra (SN). Counting the number of dopaminergic neurons in the\nSN is one of the most important indexes in evaluating drug efficacy in PD\nanimal models. Currently, analyzing and quantifying dopaminergic neurons is\nconducted manually by experts through analysis of digital pathology images\nwhich is laborious, time-consuming, and highly subjective. As such, a reliable\nand unbiased automated system is demanded for the quantification of\ndopaminergic neurons in digital pathology images. Recent years have seen a\nsurge in adopting deep learning solutions in medical image processing. However,\ndeveloping high-performing deep learning models hinges on the availability of\nlarge-scale, high-quality annotated data, which can be expensive to acquire,\nespecially in applications like digital pathology image analysis. To this end,\nwe propose an end-to-end deep learning framework based on self-supervised\nlearning for the segmentation and quantification of dopaminergic neurons in PD\nanimal models. To the best of our knowledge, this is the first deep learning\nmodel that detects the cell body of dopaminergic neurons, counts the number of\ndopaminergic neurons, and provides characteristics of individual dopaminergic\nneurons as a numerical output. Extensive experiments demonstrate the\neffectiveness of our model in quantifying neurons with high precision, which\ncan provide a faster turnaround for drug efficacy studies, better understanding\nof dopaminergic neuronal health status, and unbiased results in PD pre-clinical\nresearch. As part of our contributions, we also provide the first publicly\navailable dataset of histology digital images along with expert annotations for\nthe segmentation of TH-positive DA neuronal soma.\n","authors":["Fatemeh Haghighi","Soumitra Ghosh","Hai Ngu","Sarah Chu","Han Lin","Mohsen Hejrati","Baris Bingol","Somaye Hashemifar"],"pdf_url":"https://arxiv.org/pdf/2301.08141v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08772v1","updated":"2023-10-12T23:38:52Z","published":"2023-10-12T23:38:52Z","title":"Investigating the Robustness and Properties of Detection Transformers\n (DETR) Toward Difficult Images","summary":" Transformer-based object detectors (DETR) have shown significant performance\nacross machine vision tasks, ultimately in object detection. This detector is\nbased on a self-attention mechanism along with the transformer encoder-decoder\narchitecture to capture the global context in the image. The critical issue to\nbe addressed is how this model architecture can handle different image\nnuisances, such as occlusion and adversarial perturbations. We studied this\nissue by measuring the performance of DETR with different experiments and\nbenchmarking the network with convolutional neural network (CNN) based\ndetectors like YOLO and Faster-RCNN. We found that DETR performs well when it\ncomes to resistance to interference from information loss in occlusion images.\nDespite that, we found that the adversarial stickers put on the image require\nthe network to produce a new unnecessary set of keys, queries, and values,\nwhich in most cases, results in a misdirection of the network. DETR also\nperformed poorer than YOLOv5 in the image corruption benchmark. Furthermore, we\nfound that DETR depends heavily on the main query when making a prediction,\nwhich leads to imbalanced contributions between queries since the main query\nreceives most of the gradient flow.\n","authors":["Zhao Ning Zou","Yuhang Zhang","Robert Wijaya"],"pdf_url":"https://arxiv.org/pdf/2310.08772v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08756v1","updated":"2023-10-12T22:51:51Z","published":"2023-10-12T22:51:51Z","title":"Intelligent Scoliosis Screening and Diagnosis: A Survey","summary":" Scoliosis is a three-dimensional spinal deformity, which may lead to abnormal\nmorphologies, such as thoracic deformity, and pelvic tilt. Severe patients may\nsuffer from nerve damage and urinary abnormalities. At present, the number of\nscoliosis patients in primary and secondary schools has exceeded five million\nin China, the incidence rate is about 3% to 5% which is growing every year. The\nresearch on scoliosis, therefore, has important clinical value. This paper\nsystematically introduces computer-assisted scoliosis screening and diagnosis\nas well as analyzes the advantages and limitations of different algorithm\nmodels in the current issue field. Moreover, the paper also discusses the\ncurrent development bottlenecks in this field and looks forward to future\ndevelopment trends.\n","authors":["Zhang Zhenlin","Pu Lixin","Li Ang","Zhang Jun","Li Xianjie","Fan Jipeng"],"pdf_url":"https://arxiv.org/pdf/2310.08756v1.pdf","comment":"in Chinese language"},{"id":"http://arxiv.org/abs/2306.14941v2","updated":"2023-10-12T22:49:49Z","published":"2023-06-26T17:54:24Z","title":"SIMMF: Semantics-aware Interactive Multiagent Motion Forecasting for\n Autonomous Vehicle Driving","summary":" Autonomous vehicles require motion forecasting of their surrounding\nmultiagents (pedestrians and vehicles) to make optimal decisions for\nnavigation. The existing methods focus on techniques to utilize the positions\nand velocities of these agents and fail to capture semantic information from\nthe scene. Moreover, to mitigate the increase in computational complexity\nassociated with the number of agents in the scene, some works leverage\nEuclidean distance to prune far-away agents. However, distance-based metric\nalone is insufficient to select relevant agents and accurately perform their\npredictions. To resolve these issues, we propose the Semantics-aware\nInteractive Multiagent Motion Forecasting (SIMMF) method to capture semantics\nalong with spatial information and optimally select relevant agents for motion\nprediction. Specifically, we achieve this by implementing a semantic-aware\nselection of relevant agents from the scene and passing them through an\nattention mechanism to extract global encodings. These encodings along with\nagents' local information, are passed through an encoder to obtain\ntime-dependent latent variables for a motion policy predicting the future\ntrajectories. Our results show that the proposed approach outperforms\nstate-of-the-art baselines and provides more accurate and scene-consistent\npredictions.\n","authors":["Vidyaa Krishnan Nivash","Ahmed H. Qureshi"],"pdf_url":"https://arxiv.org/pdf/2306.14941v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08755v1","updated":"2023-10-12T22:45:03Z","published":"2023-10-12T22:45:03Z","title":"PU-Ray: Point Cloud Upsampling via Ray Marching on Implicit Surface","summary":" While the recent advancements in deep-learning-based point cloud upsampling\nmethods improve the input to autonomous driving systems, they still suffer from\nthe uncertainty of denser point generation resulting from end-to-end learning.\nFor example, due to the vague training objectives of the models, their\nperformance depends on the point distributions of the input and the ground\ntruth. This causes problems of domain dependency between synthetic and\nreal-scanned point clouds and issues with substantial model sizes and dataset\nrequirements. Additionally, many existing methods upsample point clouds with a\nfixed scaling rate, making them inflexible and computationally redundant. This\npaper addresses the above problems by proposing a ray-based upsampling approach\nwith an arbitrary rate, where a depth prediction is made for each query ray.\nThe method simulates the ray marching algorithm to achieve more precise and\nstable ray-depth predictions through implicit surface learning. The rule-based\nmid-point query sampling method enables a uniform output point distribution\nwithout requiring model training using the Chamfer distance loss function,\nwhich can exhibit bias towards the training dataset. Self-supervised learning\nbecomes possible with accurate ground truths within the input point cloud. The\nresults demonstrate the method's versatility across different domains and\ntraining scenarios with limited computational resources and training data. This\nallows the upsampling task to transition from academic research to real-world\napplications.\n","authors":["Sangwon Lim","Karim El-Basyouny","Yee Hong Yang"],"pdf_url":"https://arxiv.org/pdf/2310.08755v1.pdf","comment":"13 pages (10 main + 3 supplement), 19 figures (10 main + 9\n supplement), 6 tables"},{"id":"http://arxiv.org/abs/2310.07682v2","updated":"2023-10-12T22:28:05Z","published":"2023-10-11T17:32:24Z","title":"Prediction of MET Overexpression in Non-Small Cell Lung Adenocarcinomas\n from Hematoxylin and Eosin Images","summary":" MET protein overexpression is a targetable event in non-small cell lung\ncancer (NSCLC) and is the subject of active drug development. Challenges in\nidentifying patients for these therapies include lack of access to validated\ntesting, such as standardized immunohistochemistry (IHC) assessment, and\nconsumption of valuable tissue for a single gene/protein assay. Development of\npre-screening algorithms using routinely available digitized hematoxylin and\neosin (H&E)-stained slides to predict MET overexpression could promote testing\nfor those who will benefit most. While assessment of MET expression using IHC\nis currently not routinely performed in NSCLC, next-generation sequencing is\ncommon and in some cases includes RNA expression panel testing. In this work,\nwe leveraged a large database of matched H&E slides and RNA expression data to\ntrain a weakly supervised model to predict MET RNA overexpression directly from\nH&E images. This model was evaluated on an independent holdout test set of 300\nover-expressed and 289 normal patients, demonstrating an ROC-AUC of 0.70 (95th\npercentile interval: 0.66 - 0.74) with stable performance characteristics\nacross different patient clinical variables and robust to synthetic noise on\nthe test set. These results suggest that H&E-based predictive models could be\nuseful to prioritize patients for confirmatory testing of MET protein or MET\ngene expression status.\n","authors":["Kshitij Ingale","Sun Hae Hong","Josh S. K. Bell","Abbas Rizvi","Amy Welch","Lingdao Sha","Irvin Ho","Kunal Nagpal","Aicha BenTaieb","Rohan P Joshi","Martin C Stumpe"],"pdf_url":"https://arxiv.org/pdf/2310.07682v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.06366v4","updated":"2023-10-12T22:25:43Z","published":"2022-10-12T16:18:25Z","title":"A Generalist Framework for Panoptic Segmentation of Images and Videos","summary":" Panoptic segmentation assigns semantic and instance ID labels to every pixel\nof an image. As permutations of instance IDs are also valid solutions, the task\nrequires learning of high-dimensional one-to-many mapping. As a result,\nstate-of-the-art approaches use customized architectures and task-specific loss\nfunctions. We formulate panoptic segmentation as a discrete data generation\nproblem, without relying on inductive bias of the task. A diffusion model is\nproposed to model panoptic masks, with a simple architecture and generic loss\nfunction. By simply adding past predictions as a conditioning signal, our\nmethod is capable of modeling video (in a streaming setting) and thereby learns\nto track object instances automatically. With extensive experiments, we\ndemonstrate that our simple approach can perform competitively to\nstate-of-the-art specialist methods in similar settings.\n","authors":["Ting Chen","Lala Li","Saurabh Saxena","Geoffrey Hinton","David J. Fleet"],"pdf_url":"https://arxiv.org/pdf/2210.06366v4.pdf","comment":"ICCV'23. Code at https://github.com/google-research/pix2seq"},{"id":"http://arxiv.org/abs/2310.08745v1","updated":"2023-10-12T22:15:06Z","published":"2023-10-12T22:15:06Z","title":"AcTExplore: Active Tactile Exploration on Unknown Objects","summary":" Tactile exploration plays a crucial role in understanding object structures\nfor fundamental robotics tasks such as grasping and manipulation. However,\nefficiently exploring such objects using tactile sensors is challenging,\nprimarily due to the large-scale unknown environments and limited sensing\ncoverage of these sensors. To this end, we present AcTExplore, an active\ntactile exploration method driven by reinforcement learning for object\nreconstruction at scales that automatically explores the object surfaces in a\nlimited number of steps. Through sufficient exploration, our algorithm\nincrementally collects tactile data and reconstructs 3D shapes of the objects\nas well, which can serve as a representation for higher-level downstream tasks.\nOur method achieves an average of 95.97% IoU coverage on unseen YCB objects\nwhile just being trained on primitive shapes.\n","authors":["Amir-Hossein Shahidzadeh","Seong Jong Yoo","Pavan Mantripragada","Chahat Deep Singh","Cornelia Fermüller","Yiannis Aloimonos"],"pdf_url":"https://arxiv.org/pdf/2310.08745v1.pdf","comment":"8 pages, 6 figures"},{"id":"http://arxiv.org/abs/2310.08743v1","updated":"2023-10-12T22:09:53Z","published":"2023-10-12T22:09:53Z","title":"Development and Validation of a Deep Learning-Based Microsatellite\n Instability Predictor from Prostate Cancer Whole-Slide Images","summary":" Microsatellite instability-high (MSI-H) is a tumor agnostic biomarker for\nimmune checkpoint inhibitor therapy. However, MSI status is not routinely\ntested in prostate cancer, in part due to low prevalence and assay cost. As\nsuch, prediction of MSI status from hematoxylin and eosin (H&E) stained\nwhole-slide images (WSIs) could identify prostate cancer patients most likely\nto benefit from confirmatory testing and becoming eligible for immunotherapy.\nProstate biopsies and surgical resections from de-identified records of\nconsecutive prostate cancer patients referred to our institution were analyzed.\nTheir MSI status was determined by next generation sequencing. Patients before\na cutoff date were split into an algorithm development set (n=4015, MSI-H 1.8%)\nand a paired validation set (n=173, MSI-H 19.7%) that consisted of two serial\nsections from each sample, one stained and scanned internally and the other at\nan external site. Patients after the cutoff date formed the temporal validation\nset (n=1350, MSI-H 2.3%). Attention-based multiple instance learning models\nwere trained to predict MSI-H from H&E WSIs. The MSI-H predictor achieved area\nunder the receiver operating characteristic curve values of 0.78 (95% CI\n[0.69-0.86]), 0.72 (95% CI [0.63-0.81]), and 0.72 (95% CI [0.62-0.82]) on the\ninternally prepared, externally prepared, and temporal validation sets,\nrespectively. While MSI-H status is significantly correlated with Gleason\nscore, the model remained predictive within each Gleason score subgroup. In\nsummary, we developed and validated an AI-based MSI-H diagnostic model on a\nlarge real-world cohort of routine H&E slides, which effectively generalized to\nexternally stained and scanned samples and a temporally independent validation\ncohort. This algorithm has the potential to direct prostate cancer patients\ntoward immunotherapy and to identify MSI-H cases secondary to Lynch syndrome.\n","authors":["Qiyuan Hu","Abbas A. Rizvi","Geoffery Schau","Kshitij Ingale","Yoni Muller","Rachel Baits","Sebastian Pretzer","Aïcha BenTaieb","Abigail Gordhamer","Roberto Nussenzveig","Adam Cole","Matthew O. Leavitt","Rohan P. Joshi","Nike Beaubier","Martin C. Stumpe","Kunal Nagpal"],"pdf_url":"https://arxiv.org/pdf/2310.08743v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2211.11744v2","updated":"2023-10-12T21:20:16Z","published":"2022-11-21T18:59:33Z","title":"Visual Dexterity: In-hand Dexterous Manipulation from Depth","summary":" In-hand object reorientation is necessary for performing many dexterous\nmanipulation tasks, such as tool use in less structured environments that\nremain beyond the reach of current robots. Prior works built reorientation\nsystems assuming one or many of the following: reorienting only specific\nobjects with simple shapes, limited range of reorientation, slow or quasistatic\nmanipulation, simulation-only results, the need for specialized and costly\nsensor suites, and other constraints which make the system infeasible for\nreal-world deployment. We present a general object reorientation controller\nthat does not make these assumptions. It uses readings from a single commodity\ndepth camera to dynamically reorient complex and new object shapes by any\nrotation in real-time, with the median reorientation time being close to seven\nseconds. The controller is trained using reinforcement learning in simulation\nand evaluated in the real world on new object shapes not used for training,\nincluding the most challenging scenario of reorienting objects held in the air\nby a downward-facing hand that must counteract gravity during reorientation.\nOur hardware platform only uses open-source components that cost less than five\nthousand dollars. While we demonstrate the ability to overcome assumptions in\nprior work, there is ample scope for improving absolute performance. For\ninstance, the challenging duck-shaped object not used for training was dropped\nin 56% of the trials. When it was not dropped, our controller reoriented the\nobject within 0.4 radians (i.e., 23 degrees) 75% of the time. Videos are\navailable at: https://taochenshh.github.io/projects/visual-dexterity.\n","authors":["Tao Chen","Megha Tippur","Siyang Wu","Vikash Kumar","Edward Adelson","Pulkit Agrawal"],"pdf_url":"https://arxiv.org/pdf/2211.11744v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2209.11359v6","updated":"2023-10-12T20:31:52Z","published":"2022-09-23T01:09:06Z","title":"CUTS: A Framework for Multigranular Unsupervised Medical Image\n Segmentation","summary":" Segmenting medical images is critical to facilitating both patient diagnoses\nand quantitative research. A major limiting factor is the lack of labeled data,\nas obtaining expert annotations for each new set of imaging data or task can be\nexpensive, labor intensive, and inconsistent among annotators. To address this,\nwe present CUTS (Contrastive and Unsupervised Training for multi-granular\nmedical image Segmentation), a fully unsupervised deep learning framework for\nmedical image segmentation to better utilize the vast majority of imaging data\nthat are not labeled or annotated. CUTS works by leveraging a novel two-stage\napproach. First, it produces an image-specific embedding map via intra-image\ncontrastive loss and a local patch reconstruction objective. Second, these\nembeddings are partitioned at dynamic levels of granularity that correspond to\nthe data topology. Ultimately, CUTS yields a series of coarse-to-fine-grained\nsegmentations that highlight image features at various scales. We apply CUTS to\nretinal fundus images and two types of brain MRI images in order to delineate\nstructures and patterns at different scales, providing distinct information\nrelevant for clinicians. When evaluated against predefined anatomical masks at\na given granularity, CUTS demonstrates improvements ranging from 10% to 200% on\ndice coefficient and Hausdorff distance compared to existing unsupervised\nmethods. Further, CUTS shows performance on par with the latest Segment\nAnything Model which was pre-trained in a supervised fashion on 11 million\nimages and 1.1 billion masks. In summary, with CUTS we demonstrate that medical\nimage segmentation can be effectively solved without relying on large, labeled\ndatasets or vast computational resources.\n","authors":["Chen Liu","Matthew Amodio","Liangbo L. Shen","Feng Gao","Arman Avesta","Sanjay Aneja","Jay C. Wang","Lucian V. Del Priore","Smita Krishnaswamy"],"pdf_url":"https://arxiv.org/pdf/2209.11359v6.pdf","comment":"Additional experiments and updated figures"},{"id":"http://arxiv.org/abs/2310.08705v1","updated":"2023-10-12T20:31:20Z","published":"2023-10-12T20:31:20Z","title":"A Benchmarking Protocol for SAR Colorization: From Regression to Deep\n Learning Approaches","summary":" Synthetic aperture radar (SAR) images are widely used in remote sensing.\nInterpreting SAR images can be challenging due to their intrinsic speckle noise\nand grayscale nature. To address this issue, SAR colorization has emerged as a\nresearch direction to colorize gray scale SAR images while preserving the\noriginal spatial information and radiometric information. However, this\nresearch field is still in its early stages, and many limitations can be\nhighlighted. In this paper, we propose a full research line for supervised\nlearning-based approaches to SAR colorization. Our approach includes a protocol\nfor generating synthetic color SAR images, several baselines, and an effective\nmethod based on the conditional generative adversarial network (cGAN) for SAR\ncolorization. We also propose numerical assessment metrics for the problem at\nhand. To our knowledge, this is the first attempt to propose a research line\nfor SAR colorization that includes a protocol, a benchmark, and a complete\nperformance evaluation. Our extensive tests demonstrate the effectiveness of\nour proposed cGAN-based network for SAR colorization. The code will be made\npublicly available.\n","authors":["Kangqing Shen","Gemine Vivone","Xiaoyuan Yang","Simone Lolli","Michael Schmitt"],"pdf_url":"https://arxiv.org/pdf/2310.08705v1.pdf","comment":"16 pages, 16 figures, 6 tables"},{"id":"http://arxiv.org/abs/2304.08960v2","updated":"2023-10-12T20:08:52Z","published":"2023-04-18T12:51:18Z","title":"Generative modeling of living cells with SO(3)-equivariant implicit\n neural representations","summary":" Data-driven cell tracking and segmentation methods in biomedical imaging\nrequire diverse and information-rich training data. In cases where the number\nof training samples is limited, synthetic computer-generated data sets can be\nused to improve these methods. This requires the synthesis of cell shapes as\nwell as corresponding microscopy images using generative models. To synthesize\nrealistic living cell shapes, the shape representation used by the generative\nmodel should be able to accurately represent fine details and changes in\ntopology, which are common in cells. These requirements are not met by 3D voxel\nmasks, which are restricted in resolution, and polygon meshes, which do not\neasily model processes like cell growth and mitosis. In this work, we propose\nto represent living cell shapes as level sets of signed distance functions\n(SDFs) which are estimated by neural networks. We optimize a fully-connected\nneural network to provide an implicit representation of the SDF value at any\npoint in a 3D+time domain, conditioned on a learned latent code that is\ndisentangled from the rotation of the cell shape. We demonstrate the\neffectiveness of this approach on cells that exhibit rapid deformations\n(Platynereis dumerilii), cells that grow and divide (C. elegans), and cells\nthat have growing and branching filopodial protrusions (A549 human lung\ncarcinoma cells). A quantitative evaluation using shape features and Dice\nsimilarity coefficients of real and synthetic cell shapes shows that our model\ncan generate topologically plausible complex cell shapes in 3D+time with high\nsimilarity to real living cell shapes. Finally, we show how microscopy images\nof living cells that correspond to our generated cell shapes can be synthesized\nusing an image-to-image model.\n","authors":["David Wiesner","Julian Suk","Sven Dummer","Tereza Nečasová","Vladimír Ulman","David Svoboda","Jelmer M. Wolterink"],"pdf_url":"https://arxiv.org/pdf/2304.08960v2.pdf","comment":"Medical Image Analysis (MedIA) 2023 (Accepted)"},{"id":"http://arxiv.org/abs/2310.08681v1","updated":"2023-10-12T19:33:53Z","published":"2023-10-12T19:33:53Z","title":"Fed-Safe: Securing Federated Learning in Healthcare Against Adversarial\n Attacks","summary":" This paper explores the security aspects of federated learning applications\nin medical image analysis. Current robustness-oriented methods like adversarial\ntraining, secure aggregation, and homomorphic encryption often risk privacy\ncompromises. The central aim is to defend the network against potential privacy\nbreaches while maintaining model robustness against adversarial manipulations.\nWe show that incorporating distributed noise, grounded in the privacy\nguarantees in federated settings, enables the development of a adversarially\nrobust model that also meets federated privacy standards. We conducted\ncomprehensive evaluations across diverse attack scenarios, parameters, and use\ncases in cancer imaging, concentrating on pathology, meningioma, and glioma.\nThe results reveal that the incorporation of distributed noise allows for the\nattainment of security levels comparable to those of conventional adversarial\ntraining while requiring fewer retraining samples to establish a robust model.\n","authors":["Erfan Darzi","Nanna M. Sijtsema","P. M. A van Ooijen"],"pdf_url":"https://arxiv.org/pdf/2310.08681v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.14791v3","updated":"2023-10-12T19:10:57Z","published":"2022-10-26T15:38:28Z","title":"ViNL: Visual Navigation and Locomotion Over Obstacles","summary":" We present Visual Navigation and Locomotion over obstacles (ViNL), which\nenables a quadrupedal robot to navigate unseen apartments while stepping over\nsmall obstacles that lie in its path (e.g., shoes, toys, cables), similar to\nhow humans and pets lift their feet over objects as they walk. ViNL consists\nof: (1) a visual navigation policy that outputs linear and angular velocity\ncommands that guides the robot to a goal coordinate in unfamiliar indoor\nenvironments; and (2) a visual locomotion policy that controls the robot's\njoints to avoid stepping on obstacles while following provided velocity\ncommands. Both the policies are entirely \"model-free\", i.e. sensors-to-actions\nneural networks trained end-to-end. The two are trained independently in two\nentirely different simulators and then seamlessly co-deployed by feeding the\nvelocity commands from the navigator to the locomotor, entirely \"zero-shot\"\n(without any co-training). While prior works have developed learning methods\nfor visual navigation or visual locomotion, to the best of our knowledge, this\nis the first fully learned approach that leverages vision to accomplish both\n(1) intelligent navigation in new environments, and (2) intelligent visual\nlocomotion that aims to traverse cluttered environments without disrupting\nobstacles. On the task of navigation to distant goals in unknown environments,\nViNL using just egocentric vision significantly outperforms prior work on\nrobust locomotion using privileged terrain maps (+32.8% success and -4.42\ncollisions per meter). Additionally, we ablate our locomotion policy to show\nthat each aspect of our approach helps reduce obstacle collisions. Videos and\ncode at http://www.joannetruong.com/projects/vinl.html\n","authors":["Simar Kareer","Naoki Yokoyama","Dhruv Batra","Sehoon Ha","Joanne Truong"],"pdf_url":"https://arxiv.org/pdf/2210.14791v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08671v1","updated":"2023-10-12T19:08:03Z","published":"2023-10-12T19:08:03Z","title":"SSG2: A new modelling paradigm for semantic segmentation","summary":" State-of-the-art models in semantic segmentation primarily operate on single,\nstatic images, generating corresponding segmentation masks. This one-shot\napproach leaves little room for error correction, as the models lack the\ncapability to integrate multiple observations for enhanced accuracy. Inspired\nby work on semantic change detection, we address this limitation by introducing\na methodology that leverages a sequence of observables generated for each\nstatic input image. By adding this \"temporal\" dimension, we exploit strong\nsignal correlations between successive observations in the sequence to reduce\nerror rates. Our framework, dubbed SSG2 (Semantic Segmentation Generation 2),\nemploys a dual-encoder, single-decoder base network augmented with a sequence\nmodel. The base model learns to predict the set intersection, union, and\ndifference of labels from dual-input images. Given a fixed target input image\nand a set of support images, the sequence model builds the predicted mask of\nthe target by synthesizing the partial views from each sequence step and\nfiltering out noise. We evaluate SSG2 across three diverse datasets:\nUrbanMonitor, featuring orthoimage tiles from Darwin, Australia with five\nspectral bands and 0.2m spatial resolution; ISPRS Potsdam, which includes true\northophoto images with multiple spectral bands and a 5cm ground sampling\ndistance; and ISIC2018, a medical dataset focused on skin lesion segmentation,\nparticularly melanoma. The SSG2 model demonstrates rapid convergence within the\nfirst few tens of epochs and significantly outperforms UNet-like baseline\nmodels with the same number of gradient updates. However, the addition of the\ntemporal dimension results in an increased memory footprint. While this could\nbe a limitation, it is offset by the advent of higher-memory GPUs and coding\noptimizations.\n","authors":["Foivos I. Diakogiannis","Suzanne Furby","Peter Caccetta","Xiaoliang Wu","Rodrigo Ibata","Ondrej Hlinka","John Taylor"],"pdf_url":"https://arxiv.org/pdf/2310.08671v1.pdf","comment":"19 pages, Under review"},{"id":"http://arxiv.org/abs/2310.08669v1","updated":"2023-10-12T19:01:06Z","published":"2023-10-12T19:01:06Z","title":"Multimodal Large Language Model for Visual Navigation","summary":" Recent efforts to enable visual navigation using large language models have\nmainly focused on developing complex prompt systems. These systems incorporate\ninstructions, observations, and history into massive text prompts, which are\nthen combined with pre-trained large language models to facilitate visual\nnavigation. In contrast, our approach aims to fine-tune large language models\nfor visual navigation without extensive prompt engineering. Our design involves\na simple text prompt, current observations, and a history collector model that\ngathers information from previous observations as input. For output, our design\nprovides a probability distribution of possible actions that the agent can take\nduring navigation. We train our model using human demonstrations and collision\nsignals from the Habitat-Matterport 3D Dataset (HM3D). Experimental results\ndemonstrate that our method outperforms state-of-the-art behavior cloning\nmethods and effectively reduces collision rates.\n","authors":["Yao-Hung Hubert Tsai","Vansh Dhar","Jialu Li","Bowen Zhang","Jian Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.08669v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.07891v2","updated":"2023-10-12T18:46:09Z","published":"2023-09-14T17:42:08Z","title":"HandNeRF: Learning to Reconstruct Hand-Object Interaction Scene from a\n Single RGB Image","summary":" This paper presents a method to learn hand-object interaction prior for\nreconstructing a 3D hand-object scene from a single RGB image. The inference as\nwell as training-data generation for 3D hand-object scene reconstruction is\nchallenging due to the depth ambiguity of a single image and occlusions by the\nhand and object. We turn this challenge into an opportunity by utilizing the\nhand shape to constrain the possible relative configuration of the hand and\nobject geometry. We design a generalizable implicit function, HandNeRF, that\nexplicitly encodes the correlation of the 3D hand shape features and 2D object\nfeatures to predict the hand and object scene geometry. With experiments on\nreal-world datasets, we show that HandNeRF is able to reconstruct hand-object\nscenes of novel grasp configurations more accurately than comparable methods.\nMoreover, we demonstrate that object reconstruction from HandNeRF ensures more\naccurate execution of downstream tasks, such as grasping and motion planning\nfor robotic hand-over and manipulation. The code will be release here:\nhttps://gitbhub.com/SamsungLabs/HandNeRF\n","authors":["Hongsuk Choi","Nikhil Chavan-Dafle","Jiacheng Yuan","Volkan Isler","Hyunsoo Park"],"pdf_url":"https://arxiv.org/pdf/2309.07891v2.pdf","comment":"12 pages including the supplementary material, 8 tables, 12 figures"},{"id":"http://arxiv.org/abs/2210.05633v3","updated":"2023-10-12T18:40:37Z","published":"2022-10-11T17:25:51Z","title":"Habitat-Matterport 3D Semantics Dataset","summary":" We present the Habitat-Matterport 3D Semantics (HM3DSEM) dataset. HM3DSEM is\nthe largest dataset of 3D real-world spaces with densely annotated semantics\nthat is currently available to the academic community. It consists of 142,646\nobject instance annotations across 216 3D spaces and 3,100 rooms within those\nspaces. The scale, quality, and diversity of object annotations far exceed\nthose of prior datasets. A key difference setting apart HM3DSEM from other\ndatasets is the use of texture information to annotate pixel-accurate object\nboundaries. We demonstrate the effectiveness of HM3DSEM dataset for the Object\nGoal Navigation task using different methods. Policies trained using HM3DSEM\nperform outperform those trained on prior datasets. Introduction of HM3DSEM in\nthe Habitat ObjectNav Challenge lead to an increase in participation from 400\nsubmissions in 2021 to 1022 submissions in 2022.\n","authors":["Karmesh Yadav","Ram Ramrakhya","Santhosh Kumar Ramakrishnan","Theo Gervet","John Turner","Aaron Gokaslan","Noah Maestre","Angel Xuan Chang","Dhruv Batra","Manolis Savva","Alexander William Clegg","Devendra Singh Chaplot"],"pdf_url":"https://arxiv.org/pdf/2210.05633v3.pdf","comment":"15 Pages, 11 Figures, 6 Tables"},{"id":"http://arxiv.org/abs/2310.08654v1","updated":"2023-10-12T18:26:48Z","published":"2023-10-12T18:26:48Z","title":"Histogram- and Diffusion-Based Medical Out-of-Distribution Detection","summary":" Out-of-distribution (OOD) detection is crucial for the safety and reliability\nof artificial intelligence algorithms, especially in the medical domain. In the\ncontext of the Medical OOD (MOOD) detection challenge 2023, we propose a\npipeline that combines a histogram-based method and a diffusion-based method.\nThe histogram-based method is designed to accurately detect homogeneous\nanomalies in the toy examples of the challenge, such as blobs with constant\nintensity values. The diffusion-based method is based on one of the latest\nmethods for unsupervised anomaly detection, called DDPM-OOD. We explore this\nmethod and propose extensive post-processing steps for pixel-level and\nsample-level anomaly detection on brain MRI and abdominal CT data provided by\nthe challenge. Our results show that the proposed DDPM method is sensitive to\nblur and bias field samples, but faces challenges with anatomical deformation,\nblack slice, and swapped patches. These findings suggest that further research\nis needed to improve the performance of DDPM for OOD detection in medical\nimages.\n","authors":["Evi M. C. Huijben","Sina Amirrajab","Josien P. W. Pluim"],"pdf_url":"https://arxiv.org/pdf/2310.08654v1.pdf","comment":"9 pages, 5 figures, submission to Medical Out-of-Distribution (MOOD)\n challenge at MICCAI 2023"},{"id":"http://arxiv.org/abs/2310.08645v1","updated":"2023-10-12T18:10:36Z","published":"2023-10-12T18:10:36Z","title":"Defect Analysis of 3D Printed Cylinder Object Using Transfer Learning\n Approaches","summary":" Additive manufacturing (AM) is gaining attention across various industries\nlike healthcare, aerospace, and automotive. However, identifying defects early\nin the AM process can reduce production costs and improve productivity - a key\nchallenge. This study explored the effectiveness of machine learning (ML)\napproaches, specifically transfer learning (TL) models, for defect detection in\n3D-printed cylinders. Images of cylinders were analyzed using models including\nVGG16, VGG19, ResNet50, ResNet101, InceptionResNetV2, and MobileNetV2.\nPerformance was compared across two datasets using accuracy, precision, recall,\nand F1-score metrics. In the first study, VGG16, InceptionResNetV2, and\nMobileNetV2 achieved perfect scores. In contrast, ResNet50 had the lowest\nperformance, with an average F1-score of 0.32. Similarly, in the second study,\nMobileNetV2 correctly classified all instances, while ResNet50 struggled with\nmore false positives and fewer true positives, resulting in an F1-score of\n0.75. Overall, the findings suggest certain TL models like MobileNetV2 can\ndeliver high accuracy for AM defect classification, although performance varies\nacross algorithms. The results provide insights into model optimization and\nintegration needs for reliable automated defect analysis during 3D printing. By\nidentifying the top-performing TL techniques, this study aims to enhance AM\nproduct quality through robust image-based monitoring and inspection.\n","authors":["Md Manjurul Ahsan","Shivakumar Raman","Zahed Siddique"],"pdf_url":"https://arxiv.org/pdf/2310.08645v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08785v1","updated":"2023-10-12T15:43:12Z","published":"2023-10-12T15:43:12Z","title":"DeltaSpace: A Semantic-aligned Feature Space for Flexible Text-guided\n Image Editing","summary":" Text-guided image editing faces significant challenges to training and\ninference flexibility. Much literature collects large amounts of annotated\nimage-text pairs to train text-conditioned generative models from scratch,\nwhich is expensive and not efficient. After that, some approaches that leverage\npre-trained vision-language models are put forward to avoid data collection,\nbut they are also limited by either per text-prompt optimization or\ninference-time hyper-parameters tuning. To address these issues, we investigate\nand identify a specific space, referred to as CLIP DeltaSpace, where the CLIP\nvisual feature difference of two images is semantically aligned with the CLIP\ntextual feature difference of their corresponding text descriptions. Based on\nDeltaSpace, we propose a novel framework called DeltaEdit, which maps the CLIP\nvisual feature differences to the latent space directions of a generative model\nduring the training phase, and predicts the latent space directions from the\nCLIP textual feature differences during the inference phase. And this design\nendows DeltaEdit with two advantages: (1) text-free training; (2)\ngeneralization to various text prompts for zero-shot inference. Extensive\nexperiments validate the effectiveness and versatility of DeltaEdit with\ndifferent generative models, including both the GAN model and the diffusion\nmodel, in achieving flexible text-guided image editing. Code is available at\nhttps://github.com/Yueming6568/DeltaEdit.\n","authors":["Yueming Lyu","Kang Zhao","Bo Peng","Yue Jiang","Yingya Zhang","Jing Dong"],"pdf_url":"https://arxiv.org/pdf/2310.08785v1.pdf","comment":"17 pages. arXiv admin note: text overlap with arXiv:2303.06285"}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2310.08319v1","updated":"2023-10-12T13:32:35Z","published":"2023-10-12T13:32:35Z","title":"Fine-Tuning LLaMA for Multi-Stage Text Retrieval","summary":" The effectiveness of multi-stage text retrieval has been solidly demonstrated\nsince before the era of pre-trained language models. However, most existing\nstudies utilize models that predate recent advances in large language models\n(LLMs). This study seeks to explore potential improvements that\nstate-of-the-art LLMs can bring. We conduct a comprehensive study, fine-tuning\nthe latest LLaMA model both as a dense retriever (RepLLaMA) and as a pointwise\nreranker (RankLLaMA) for both passage retrieval and document retrieval using\nthe MS MARCO datasets. Our findings demonstrate that the effectiveness of large\nlanguage models indeed surpasses that of smaller models. Additionally, since\nLLMs can inherently handle longer contexts, they can represent entire documents\nholistically, obviating the need for traditional segmenting and pooling\nstrategies. Furthermore, evaluations on BEIR demonstrate that our\nRepLLaMA-RankLLaMA pipeline exhibits strong zero-shot effectiveness. Model\ncheckpoints from this study are available on HuggingFace.\n","authors":["Xueguang Ma","Liang Wang","Nan Yang","Furu Wei","Jimmy Lin"],"pdf_url":"https://arxiv.org/pdf/2310.08319v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08083v1","updated":"2023-10-12T07:14:22Z","published":"2023-10-12T07:14:22Z","title":"On Using GUI Interaction Data to Improve Text Retrieval-based Bug\n Localization","summary":" One of the most important tasks related to managing bug reports is localizing\nthe fault so that a fix can be applied. As such, prior work has aimed to\nautomate this task of bug localization by formulating it as an information\nretrieval problem, where potentially buggy files are retrieved and ranked\naccording to their textual similarity with a given bug report. However, there\nis often a notable semantic gap between the information contained in bug\nreports and identifiers or natural language contained within source code files.\nFor user-facing software, there is currently a key source of information that\ncould aid in bug localization, but has not been thoroughly investigated -\ninformation from the GUI.\n We investigate the hypothesis that, for end user-facing applications,\nconnecting information in a bug report with information from the GUI, and using\nthis to aid in retrieving potentially buggy files, can improve upon existing\ntechniques for bug localization. To examine this phenomenon, we conduct a\ncomprehensive empirical study that augments four baseline techniques for bug\nlocalization with GUI interaction information from a reproduction scenario to\n(i) filter out potentially irrelevant files, (ii) boost potentially relevant\nfiles, and (iii) reformulate text-retrieval queries. To carry out our study, we\nsource the current largest dataset of fully-localized and reproducible real\nbugs for Android apps, with corresponding bug reports, consisting of 80 bug\nreports from 39 popular open-source apps. Our results illustrate that\naugmenting traditional techniques with GUI information leads to a marked\nincrease in effectiveness across multiple metrics, including a relative\nincrease in Hits@10 of 13-18%. Additionally, through further analysis, we find\nthat our studied augmentations largely complement existing techniques.\n","authors":["Junayed Mahmud","Nadeeshan De Silva","Safwat Ali Khan","Seyed Hooman Mostafavi","SM Hasan Mansur","Oscar Chaparro","Andrian Marcus","Kevin Moran"],"pdf_url":"https://arxiv.org/pdf/2310.08083v1.pdf","comment":"13 pages, to appear in the Proceedings of the 46th International\n Conference on Software Engineering (ICSE'24)"},{"id":"http://arxiv.org/abs/2310.08069v1","updated":"2023-10-12T06:32:42Z","published":"2023-10-12T06:32:42Z","title":"Rethinking Negative Pairs in Code Search","summary":" Recently, contrastive learning has become a key component in fine-tuning code\nsearch models for software development efficiency and effectiveness. It pulls\ntogether positive code snippets while pushing negative samples away given\nsearch queries. Among contrastive learning, InfoNCE is the most widely used\nloss function due to its better performance. However, the following problems in\nnegative samples of InfoNCE may deteriorate its representation learning: 1) The\nexistence of false negative samples in large code corpora due to duplications.\n2). The failure to explicitly differentiate between the potential relevance of\nnegative samples. As an example, a bubble sorting algorithm example is less\n``negative'' than a file saving function for the quick sorting algorithm query.\nIn this paper, we tackle the above problems by proposing a simple yet effective\nSoft-InfoNCE loss that inserts weight terms into InfoNCE. In our proposed loss\nfunction, we apply three methods to estimate the weights of negative pairs and\nshow that the vanilla InfoNCE loss is a special case of Soft-InfoNCE.\nTheoretically, we analyze the effects of Soft-InfoNCE on controlling the\ndistribution of learnt code representations and on deducing a more precise\nmutual information estimation. We furthermore discuss the superiority of\nproposed loss functions with other design alternatives. Extensive experiments\ndemonstrate the effectiveness of Soft-InfoNCE and weights estimation methods\nunder state-of-the-art code search models on a large-scale public dataset\nconsisting of six programming languages. Source code is available at\n\\url{https://github.com/Alex-HaochenLi/Soft-InfoNCE}.\n","authors":["Haochen Li","Xin Zhou","Luu Anh Tuan","Chunyan Miao"],"pdf_url":"https://arxiv.org/pdf/2310.08069v1.pdf","comment":"Accepted to EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.08039v1","updated":"2023-10-12T05:14:42Z","published":"2023-10-12T05:14:42Z","title":"Rethinking Large-scale Pre-ranking System: Entire-chain Cross-domain\n Models","summary":" Industrial systems such as recommender systems and online advertising, have\nbeen widely equipped with multi-stage architectures, which are divided into\nseveral cascaded modules, including matching, pre-ranking, ranking and\nre-ranking. As a critical bridge between matching and ranking, existing\npre-ranking approaches mainly endure sample selection bias (SSB) problem owing\nto ignoring the entire-chain data dependence, resulting in sub-optimal\nperformances. In this paper, we rethink pre-ranking system from the perspective\nof the entire sample space, and propose Entire-chain Cross-domain Models (ECM),\nwhich leverage samples from the whole cascaded stages to effectively alleviate\nSSB problem. Besides, we design a fine-grained neural structure named ECMM to\nfurther improve the pre-ranking accuracy. Specifically, we propose a\ncross-domain multi-tower neural network to comprehensively predict for each\nstage result, and introduce the sub-networking routing strategy with $L0$\nregularization to reduce computational costs. Evaluations on real-world\nlarge-scale traffic logs demonstrate that our pre-ranking models outperform\nSOTA methods while time consumption is maintained within an acceptable level,\nwhich achieves better trade-off between efficiency and effectiveness.\n","authors":["Jinbo Song","Ruoran Huang","Xinyang Wang","Wei Huang","Qian Yu","Mingming Chen","Yafei Yao","Chaosheng Fan","Changping Peng","Zhangang Lin","Jinghe Hu","Jingping Shao"],"pdf_url":"https://arxiv.org/pdf/2310.08039v1.pdf","comment":"5 pages, 2 figures"},{"id":"http://arxiv.org/abs/2310.08038v1","updated":"2023-10-12T05:09:27Z","published":"2023-10-12T05:09:27Z","title":"Continual Learning via Manifold Expansion Replay","summary":" In continual learning, the learner learns multiple tasks in sequence, with\ndata being acquired only once for each task. Catastrophic forgetting is a major\nchallenge to continual learning. To reduce forgetting, some existing\nrehearsal-based methods use episodic memory to replay samples of previous\ntasks. However, in the process of knowledge integration when learning a new\ntask, this strategy also suffers from catastrophic forgetting due to an\nimbalance between old and new knowledge. To address this problem, we propose a\nnovel replay strategy called Manifold Expansion Replay (MaER). We argue that\nexpanding the implicit manifold of the knowledge representation in the episodic\nmemory helps to improve the robustness and expressiveness of the model. To this\nend, we propose a greedy strategy to keep increasing the diameter of the\nimplicit manifold represented by the knowledge in the buffer during memory\nmanagement. In addition, we introduce Wasserstein distance instead of cross\nentropy as distillation loss to preserve previous knowledge. With extensive\nexperimental validation on MNIST, CIFAR10, CIFAR100, and TinyImageNet, we show\nthat the proposed method significantly improves the accuracy in continual\nlearning setup, outperforming the state of the arts.\n","authors":["Zihao Xu","Xuan Tang","Yufei Shi","Jianfeng Zhang","Jian Yang","Mingsong Chen","Xian Wei"],"pdf_url":"https://arxiv.org/pdf/2310.08038v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07990v1","updated":"2023-10-12T02:34:56Z","published":"2023-10-12T02:34:56Z","title":"Multi-View Variational Autoencoder for Missing Value Imputation in\n Untargeted Metabolomics","summary":" Background: Missing data is a common challenge in mass spectrometry-based\nmetabolomics, which can lead to biased and incomplete analyses. The integration\nof whole-genome sequencing (WGS) data with metabolomics data has emerged as a\npromising approach to enhance the accuracy of data imputation in metabolomics\nstudies. Method: In this study, we propose a novel method that leverages the\ninformation from WGS data and reference metabolites to impute unknown\nmetabolites. Our approach utilizes a multi-view variational autoencoder to\njointly model the burden score, polygenetic risk score (PGS), and linkage\ndisequilibrium (LD) pruned single nucleotide polymorphisms (SNPs) for feature\nextraction and missing metabolomics data imputation. By learning the latent\nrepresentations of both omics data, our method can effectively impute missing\nmetabolomics values based on genomic information. Results: We evaluate the\nperformance of our method on empirical metabolomics datasets with missing\nvalues and demonstrate its superiority compared to conventional imputation\ntechniques. Using 35 template metabolites derived burden scores, PGS and\nLD-pruned SNPs, the proposed methods achieved r2-scores > 0.01 for 71.55% of\nmetabolites. Conclusion: The integration of WGS data in metabolomics imputation\nnot only improves data completeness but also enhances downstream analyses,\npaving the way for more comprehensive and accurate investigations of metabolic\npathways and disease associations. Our findings offer valuable insights into\nthe potential benefits of utilizing WGS data for metabolomics data imputation\nand underscore the importance of leveraging multi-modal data integration in\nprecision medicine research.\n","authors":["Chen Zhao","Kuan-Jui Su","Chong Wu","Xuewei Cao","Qiuying Sha","Wu Li","Zhe Luo","Tian Qin","Chuan Qiu","Lan Juan Zhao","Anqi Liu","Lindong Jiang","Xiao Zhang","Hui Shen","Weihua Zhou","Hong-Wen Deng"],"pdf_url":"https://arxiv.org/pdf/2310.07990v1.pdf","comment":"19 pages, 3 figures"},{"id":"http://arxiv.org/abs/2310.08759v1","updated":"2023-10-12T22:56:53Z","published":"2023-10-12T22:56:53Z","title":"Question Answering for Electronic Health Records: A Scoping Review of\n datasets and models","summary":" Question Answering (QA) systems on patient-related data can assist both\nclinicians and patients. They can, for example, assist clinicians in\ndecision-making and enable patients to have a better understanding of their\nmedical history. Significant amounts of patient data are stored in Electronic\nHealth Records (EHRs), making EHR QA an important research area. In EHR QA, the\nanswer is obtained from the medical record of the patient. Because of the\ndifferences in data format and modality, this differs greatly from other\nmedical QA tasks that employ medical websites or scientific papers to retrieve\nanswers, making it critical to research EHR question answering. This study\naimed to provide a methodological review of existing works on QA over EHRs. We\nsearched for articles from January 1st, 2005 to September 30th, 2023 in four\ndigital sources including Google Scholar, ACL Anthology, ACM Digital Library,\nand PubMed to collect relevant publications on EHR QA. 4111 papers were\nidentified for our study, and after screening based on our inclusion criteria,\nwe obtained a total of 47 papers for further study. Out of the 47 papers, 25\npapers were about EHR QA datasets, and 37 papers were about EHR QA models. It\nwas observed that QA on EHRs is relatively new and unexplored. Most of the\nworks are fairly recent. Also, it was observed that emrQA is by far the most\npopular EHR QA dataset, both in terms of citations and usage in other papers.\nFurthermore, we identified the different models used in EHR QA along with the\nevaluation metrics used for these models.\n","authors":["Jayetri Bardhan","Kirk Roberts","Daisy Zhe Wang"],"pdf_url":"https://arxiv.org/pdf/2310.08759v1.pdf","comment":"5 tables, 6 figures"},{"id":"http://arxiv.org/abs/2306.15010v2","updated":"2023-10-12T06:45:38Z","published":"2023-06-26T18:49:09Z","title":"Efficient High-Resolution Template Matching with Vector Quantized\n Nearest Neighbour Fields","summary":" Template matching is a fundamental problem in computer vision with\napplications in fields including object detection, image registration, and\nobject tracking. Current methods rely on nearest-neighbour (NN) matching, where\nthe query feature space is converted to NN space by representing each query\npixel with its NN in the template. NN-based methods have been shown to perform\nbetter in occlusions, appearance changes, and non-rigid transformations;\nhowever, they scale poorly with high-resolution data and high feature\ndimensions. We present an NN-based method which efficiently reduces the NN\ncomputations and introduces filtering in the NN fields (NNFs). A vector\nquantization step is introduced before the NN calculation to represent the\ntemplate with $k$ features, and the filter response over the NNFs is used to\ncompare the template and query distributions over the features. We show that\nstate-of-the-art performance is achieved in low-resolution data, and our method\noutperforms previous methods at higher resolution.\n","authors":["Ankit Gupta","Ida-Maria Sintorn"],"pdf_url":"https://arxiv.org/pdf/2306.15010v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08613v1","updated":"2023-10-12T06:26:20Z","published":"2023-10-12T06:26:20Z","title":"Individual Variation Affects Outbreak Magnitude and Predictability in an\n Extended Multi-Pathogen SIR Model of Pigeons Vising Dairy Farms","summary":" Zoonotic disease transmission between animals and humans is a growing risk\nand the agricultural context acts as a likely point of transition, with\nindividual heterogeneity acting as an important contributor. Thus,\nunderstanding the dynamics of disease spread in the wildlife-livestock\ninterface is crucial for mitigating these risks of transmission. Specifically,\nthe interactions between pigeons and in-door cows at dairy farms can lead to\nsignificant disease transmission and economic losses for farmers; putting\nlivestock, adjacent human populations, and other wildlife species at risk. In\nthis paper, we propose a novel spatio-temporal multi-pathogen model with\ncontinuous spatial movement. The model expands on the\nSusceptible-Exposed-Infected-Recovered-Dead (SEIRD) framework and accounts for\nboth within-species and cross-species transmission of pathogens, as well as the\nexploration-exploitation movement dynamics of pigeons, which play a critical\nrole in the spread of infection agents. In addition to model formulation, we\nalso implement it as an agent-based simulation approach and use empirical field\ndata to investigate different biologically realistic scenarios, evaluating the\neffect of various parameters on the epidemic spread. Namely, in agreement with\ntheoretical expectations, the model predicts that the heterogeneity of the\npigeons' movement dynamics can drastically affect both the magnitude and\nstability of outbreaks. In addition, joint infection by multiple pathogens can\nhave an interactive effect unobservable in single-pathogen SIR models,\nreflecting a non-intuitive inhibition of the outbreak. Our findings highlight\nthe impact of heterogeneity in host behavior on their pathogens and allow\nrealistic predictions of outbreak dynamics in the multi-pathogen\nwildlife-livestock interface with consequences to zoonotic diseases in various\nsystems.\n","authors":["Teddy Lazebnik","Orr Spiegel"],"pdf_url":"https://arxiv.org/pdf/2310.08613v1.pdf","comment":null}],"Machine Learning":[{"id":"http://arxiv.org/abs/2310.08588v1","updated":"2023-10-12T17:59:58Z","published":"2023-10-12T17:59:58Z","title":"Octopus: Embodied Vision-Language Programmer from Environmental Feedback","summary":" Large vision-language models (VLMs) have achieved substantial progress in\nmultimodal perception and reasoning. Furthermore, when seamlessly integrated\ninto an embodied agent, it signifies a crucial stride towards the creation of\nautonomous and context-aware systems capable of formulating plans and executing\ncommands with precision. In this paper, we introduce Octopus, a novel VLM\ndesigned to proficiently decipher an agent's vision and textual task objectives\nand to formulate intricate action sequences and generate executable code. Our\ndesign allows the agent to adeptly handle a wide spectrum of tasks, ranging\nfrom mundane daily chores in simulators to sophisticated interactions in\ncomplex video games. Octopus is trained by leveraging GPT-4 to control an\nexplorative agent to generate training data, i.e., action blueprints and the\ncorresponding executable code, within our experimental environment called\nOctoVerse. We also collect the feedback that allows the enhanced training\nscheme of Reinforcement Learning with Environmental Feedback (RLEF). Through a\nseries of experiments, we illuminate Octopus's functionality and present\ncompelling results, and the proposed RLEF turns out to refine the agent's\ndecision-making. By open-sourcing our model architecture, simulator, and\ndataset, we aspire to ignite further innovation and foster collaborative\napplications within the broader embodied AI community.\n","authors":["Jingkang Yang","Yuhao Dong","Shuai Liu","Bo Li","Ziyue Wang","Chencheng Jiang","Haoran Tan","Jiamu Kang","Yuanhan Zhang","Kaiyang Zhou","Ziwei Liu"],"pdf_url":"https://arxiv.org/pdf/2310.08588v1.pdf","comment":"Project Page: https://choiszt.github.io/Octopus/, Codebase:\n https://github.com/dongyh20/Octopus"},{"id":"http://arxiv.org/abs/2310.08582v1","updated":"2023-10-12T17:59:50Z","published":"2023-10-12T17:59:50Z","title":"Tree-Planner: Efficient Close-loop Task Planning with Large Language\n Models","summary":" This paper studies close-loop task planning, which refers to the process of\ngenerating a sequence of skills (a plan) to accomplish a specific goal while\nadapting the plan based on real-time observations. Recently, prompting Large\nLanguage Models (LLMs) to generate actions iteratively has become a prevalent\nparadigm due to its superior performance and user-friendliness. However, this\nparadigm is plagued by two inefficiencies: high token consumption and redundant\nerror correction, both of which hinder its scalability for large-scale testing\nand applications. To address these issues, we propose Tree-Planner, which\nreframes task planning with LLMs into three distinct phases: plan sampling,\naction tree construction, and grounded deciding. Tree-Planner starts by using\nan LLM to sample a set of potential plans before execution, followed by the\naggregation of them to form an action tree. Finally, the LLM performs a\ntop-down decision-making process on the tree, taking into account real-time\nenvironmental information. Experiments show that Tree-Planner achieves\nstate-of-the-art performance while maintaining high efficiency. By decomposing\nLLM queries into a single plan-sampling call and multiple grounded-deciding\ncalls, a considerable part of the prompt are less likely to be repeatedly\nconsumed. As a result, token consumption is reduced by 92.2% compared to the\npreviously best-performing model. Additionally, by enabling backtracking on the\naction tree as needed, the correction process becomes more flexible, leading to\na 40.5% decrease in error corrections. Project page:\nhttps://tree-planner.github.io/\n","authors":["Mengkang Hu","Yao Mu","Xinmiao Yu","Mingyu Ding","Shiguang Wu","Wenqi Shao","Qiguang Chen","Bin Wang","Yu Qiao","Ping Luo"],"pdf_url":"https://arxiv.org/pdf/2310.08582v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08577v1","updated":"2023-10-12T17:59:30Z","published":"2023-10-12T17:59:30Z","title":"Visual Data-Type Understanding does not emerge from Scaling\n Vision-Language Models","summary":" Recent advances in the development of vision-language models (VLMs) are\nyielding remarkable success in recognizing visual semantic content, including\nimpressive instances of compositional image understanding. Here, we introduce\nthe novel task of \\textit{Visual Data-Type Identification}, a basic perceptual\nskill with implications for data curation (e.g., noisy data-removal from large\ndatasets, domain-specific retrieval) and autonomous vision (e.g.,\ndistinguishing changing weather conditions from camera lens staining). We\ndevelop two datasets consisting of animal images altered across a diverse set\nof 27 visual \\textit{data-types}, spanning four broad categories. An extensive\nzero-shot evaluation of 39 VLMs, ranging from 100M to 80B parameters, shows a\nnuanced performance landscape. While VLMs are reasonably good at identifying\ncertain stylistic \\textit{data-types}, such as cartoons and sketches, they\nstruggle with simpler \\textit{data-types} arising from basic manipulations like\nimage rotations or additive noise. Our findings reveal that (i) model scaling\nalone yields marginal gains for contrastively-trained models like CLIP, and\n(ii) there is a pronounced drop in performance for the largest\nauto-regressively trained VLMs like OpenFlamingo. This finding points to a\nblind spot in current frontier VLMs: they excel in recognizing semantic content\nbut fail to acquire an understanding of visual \\textit{data-types} through\nscaling. By analyzing the pre-training distributions of these models and\nincorporating \\textit{data-type} information into the captions during\nfine-tuning, we achieve a significant enhancement in performance. By exploring\nthis previously uncharted task, we aim to set the stage for further advancing\nVLMs to equip them with visual data-type understanding. Code and datasets are\nreleased \\href{https://github.com/bethgelab/DataTypeIdentification}{here}.\n","authors":["Vishaal Udandarao","Max F. Burg","Samuel Albanie","Matthias Bethge"],"pdf_url":"https://arxiv.org/pdf/2310.08577v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08576v1","updated":"2023-10-12T17:59:23Z","published":"2023-10-12T17:59:23Z","title":"Learning to Act from Actionless Videos through Dense Correspondences","summary":" In this work, we present an approach to construct a video-based robot policy\ncapable of reliably executing diverse tasks across different robots and\nenvironments from few video demonstrations without using any action\nannotations. Our method leverages images as a task-agnostic representation,\nencoding both the state and action information, and text as a general\nrepresentation for specifying robot goals. By synthesizing videos that\n``hallucinate'' robot executing actions and in combination with dense\ncorrespondences between frames, our approach can infer the closed-formed action\nto execute to an environment without the need of any explicit action labels.\nThis unique capability allows us to train the policy solely based on RGB videos\nand deploy learned policies to various robotic tasks. We demonstrate the\nefficacy of our approach in learning policies on table-top manipulation and\nnavigation tasks. Additionally, we contribute an open-source framework for\nefficient video modeling, enabling the training of high-fidelity policy models\nwith four GPUs within a single day.\n","authors":["Po-Chen Ko","Jiayuan Mao","Yilun Du","Shao-Hua Sun","Joshua B. Tenenbaum"],"pdf_url":"https://arxiv.org/pdf/2310.08576v1.pdf","comment":"Project page: https://flow-diffusion.github.io/"},{"id":"http://arxiv.org/abs/2209.07481v2","updated":"2023-10-12T17:59:22Z","published":"2022-09-15T17:22:04Z","title":"Quasi-Arithmetic Mixtures, Divergence Minimization, and Bregman\n Information","summary":" Markov Chain Monte Carlo methods for sampling from complex distributions and\nestimating normalization constants often simulate samples from a sequence of\nintermediate distributions along an annealing path, which bridges between a\ntractable initial distribution and a target density of interest. Prior work has\nconstructed annealing paths using quasi-arithmetic means, and interpreted the\nresulting intermediate densities as minimizing an expected divergence to the\nendpoints. We provide a comprehensive analysis of this 'centroid' property\nusing Bregman divergences under a monotonic embedding of the density function,\nthereby associating common divergences such as Amari's and Renyi's\n${\\alpha}$-divergences, ${(\\alpha,\\beta)}$-divergences, and the Jensen-Shannon\ndivergence with intermediate densities along an annealing path. Our analysis\nhighlights the interplay between parametric families, quasi-arithmetic means,\nand divergence functions using the rho-tau Bregman divergence framework of\nZhang 2004,2013.\n","authors":["Rob Brekelmans","Frank Nielsen"],"pdf_url":"https://arxiv.org/pdf/2209.07481v2.pdf","comment":"19 pages + appendix (rewritten + changed title in revision)"},{"id":"http://arxiv.org/abs/2310.08574v1","updated":"2023-10-12T17:57:57Z","published":"2023-10-12T17:57:57Z","title":"Jigsaw: Supporting Designers in Prototyping Multimodal Applications by\n Assembling AI Foundation Models","summary":" Recent advancements in AI foundation models have made it possible for them to\nbe utilized off-the-shelf for creative tasks, including ideating design\nconcepts or generating visual prototypes. However, integrating these models\ninto the creative process can be challenging as they often exist as standalone\napplications tailored to specific tasks. To address this challenge, we\nintroduce Jigsaw, a prototype system that employs puzzle pieces as metaphors to\nrepresent foundation models. Jigsaw allows designers to combine different\nfoundation model capabilities across various modalities by assembling\ncompatible puzzle pieces. To inform the design of Jigsaw, we interviewed ten\ndesigners and distilled design goals. In a user study, we showed that Jigsaw\nenhanced designers' understanding of available foundation model capabilities,\nprovided guidance on combining capabilities across different modalities and\ntasks, and served as a canvas to support design exploration, prototyping, and\ndocumentation.\n","authors":["David Chuan-En Lin","Nikolas Martelaro"],"pdf_url":"https://arxiv.org/pdf/2310.08574v1.pdf","comment":"Webpage: https://preview.jigsaw.to"},{"id":"http://arxiv.org/abs/2310.08571v1","updated":"2023-10-12T17:56:53Z","published":"2023-10-12T17:56:53Z","title":"Bucks for Buckets (B4B): Active Defenses Against Stealing Encoders","summary":" Machine Learning as a Service (MLaaS) APIs provide ready-to-use and\nhigh-utility encoders that generate vector representations for given inputs.\nSince these encoders are very costly to train, they become lucrative targets\nfor model stealing attacks during which an adversary leverages query access to\nthe API to replicate the encoder locally at a fraction of the original training\ncosts. We propose Bucks for Buckets (B4B), the first active defense that\nprevents stealing while the attack is happening without degrading\nrepresentation quality for legitimate API users. Our defense relies on the\nobservation that the representations returned to adversaries who try to steal\nthe encoder's functionality cover a significantly larger fraction of the\nembedding space than representations of legitimate users who utilize the\nencoder to solve a particular downstream task.vB4B leverages this to adaptively\nadjust the utility of the returned representations according to a user's\ncoverage of the embedding space. To prevent adaptive adversaries from eluding\nour defense by simply creating multiple user accounts (sybils), B4B also\nindividually transforms each user's representations. This prevents the\nadversary from directly aggregating representations over multiple accounts to\ncreate their stolen encoder copy. Our active defense opens a new path towards\nsecurely sharing and democratizing encoders over public APIs.\n","authors":["Jan Dubiński","Stanisław Pawlak","Franziska Boenisch","Tomasz Trzciński","Adam Dziedzic"],"pdf_url":"https://arxiv.org/pdf/2310.08571v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08566v1","updated":"2023-10-12T17:55:02Z","published":"2023-10-12T17:55:02Z","title":"Transformers as Decision Makers: Provable In-Context Reinforcement\n Learning via Supervised Pretraining","summary":" Large transformer models pretrained on offline reinforcement learning\ndatasets have demonstrated remarkable in-context reinforcement learning (ICRL)\ncapabilities, where they can make good decisions when prompted with interaction\ntrajectories from unseen environments. However, when and how transformers can\nbe trained to perform ICRL have not been theoretically well-understood. In\nparticular, it is unclear which reinforcement-learning algorithms transformers\ncan perform in context, and how distribution mismatch in offline training data\naffects the learned algorithms. This paper provides a theoretical framework\nthat analyzes supervised pretraining for ICRL. This includes two recently\nproposed training methods -- algorithm distillation and decision-pretrained\ntransformers. First, assuming model realizability, we prove the\nsupervised-pretrained transformer will imitate the conditional expectation of\nthe expert algorithm given the observed trajectory. The generalization error\nwill scale with model capacity and a distribution divergence factor between the\nexpert and offline algorithms. Second, we show transformers with ReLU attention\ncan efficiently approximate near-optimal online reinforcement learning\nalgorithms like LinUCB and Thompson sampling for stochastic linear bandits, and\nUCB-VI for tabular Markov decision processes. This provides the first\nquantitative analysis of the ICRL capabilities of transformers pretrained from\noffline trajectories.\n","authors":["Licong Lin","Yu Bai","Song Mei"],"pdf_url":"https://arxiv.org/pdf/2310.08566v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08558v1","updated":"2023-10-12T17:50:09Z","published":"2023-10-12T17:50:09Z","title":"Offline Retraining for Online RL: Decoupled Policy Learning to Mitigate\n Exploration Bias","summary":" It is desirable for policies to optimistically explore new states and\nbehaviors during online reinforcement learning (RL) or fine-tuning, especially\nwhen prior offline data does not provide enough state coverage. However,\nexploration bonuses can bias the learned policy, and our experiments find that\nnaive, yet standard use of such bonuses can fail to recover a performant\npolicy. Concurrently, pessimistic training in offline RL has enabled recovery\nof performant policies from static datasets. Can we leverage offline RL to\nrecover better policies from online interaction? We make a simple observation\nthat a policy can be trained from scratch on all interaction data with\npessimistic objectives, thereby decoupling the policies used for data\ncollection and for evaluation. Specifically, we propose offline retraining, a\npolicy extraction step at the end of online fine-tuning in our\nOffline-to-Online-to-Offline (OOO) framework for reinforcement learning (RL).\nAn optimistic (exploration) policy is used to interact with the environment,\nand a separate pessimistic (exploitation) policy is trained on all the observed\ndata for evaluation. Such decoupling can reduce any bias from online\ninteraction (intrinsic rewards, primacy bias) in the evaluation policy, and can\nallow more exploratory behaviors during online interaction which in turn can\ngenerate better data for exploitation. OOO is complementary to several\noffline-to-online RL and online RL methods, and improves their average\nperformance by 14% to 26% in our fine-tuning experiments, achieves\nstate-of-the-art performance on several environments in the D4RL benchmarks,\nand improves online RL performance by 165% on two OpenAI gym environments.\nFurther, OOO can enable fine-tuning from incomplete offline datasets where\nprior methods can fail to recover a performant policy. Implementation:\nhttps://github.com/MaxSobolMark/OOO\n","authors":["Max Sobol Mark","Archit Sharma","Fahim Tajwar","Rafael Rafailov","Sergey Levine","Chelsea Finn"],"pdf_url":"https://arxiv.org/pdf/2310.08558v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.04370v2","updated":"2023-10-12T17:48:22Z","published":"2023-09-08T15:02:46Z","title":"Seeing-Eye Quadruped Navigation with Force Responsive Locomotion Control","summary":" Seeing-eye robots are very useful tools for guiding visually impaired people,\npotentially producing a huge societal impact given the low availability and\nhigh cost of real guide dogs. Although a few seeing-eye robot systems have\nalready been demonstrated, none considered external tugs from humans, which\nfrequently occur in a real guide dog setting. In this paper, we simultaneously\ntrain a locomotion controller that is robust to external tugging forces via\nReinforcement Learning (RL), and an external force estimator via supervised\nlearning. The controller ensures stable walking, and the force estimator\nenables the robot to respond to the external forces from the human. These\nforces are used to guide the robot to the global goal, which is unknown to the\nrobot, while the robot guides the human around nearby obstacles via a local\nplanner. Experimental results in simulation and on hardware show that our\ncontroller is robust to external forces, and our seeing-eye system can\naccurately detect force direction. We demonstrate our full seeing-eye robot\nsystem on a real quadruped robot with a blindfolded human. The video can be\nseen at our project page: https://bu-air-lab.github.io/guide_dog/\n","authors":["David DeFazio","Eisuke Hirota","Shiqi Zhang"],"pdf_url":"https://arxiv.org/pdf/2309.04370v2.pdf","comment":"Accepted to CoRL 2023"},{"id":"http://arxiv.org/abs/2310.08549v1","updated":"2023-10-12T17:45:05Z","published":"2023-10-12T17:45:05Z","title":"Cross-Episodic Curriculum for Transformer Agents","summary":" We present a new algorithm, Cross-Episodic Curriculum (CEC), to boost the\nlearning efficiency and generalization of Transformer agents. Central to CEC is\nthe placement of cross-episodic experiences into a Transformer's context, which\nforms the basis of a curriculum. By sequentially structuring online learning\ntrials and mixed-quality demonstrations, CEC constructs curricula that\nencapsulate learning progression and proficiency increase across episodes. Such\nsynergy combined with the potent pattern recognition capabilities of\nTransformer models delivers a powerful cross-episodic attention mechanism. The\neffectiveness of CEC is demonstrated under two representative scenarios: one\ninvolving multi-task reinforcement learning with discrete control, such as in\nDeepMind Lab, where the curriculum captures the learning progression in both\nindividual and progressively complex settings; and the other involving\nimitation learning with mixed-quality data for continuous control, as seen in\nRoboMimic, where the curriculum captures the improvement in demonstrators'\nexpertise. In all instances, policies resulting from CEC exhibit superior\nperformance and strong generalization. Code is open-sourced at\nhttps://cec-agent.github.io/ to facilitate research on Transformer agent\nlearning.\n","authors":["Lucy Xiaoyang Shi","Yunfan Jiang","Jake Grigsby","Linxi \"Jim\" Fan","Yuke Zhu"],"pdf_url":"https://arxiv.org/pdf/2310.08549v1.pdf","comment":"To appear in NeurIPS 2023; The first two authors contributed equally"},{"id":"http://arxiv.org/abs/2310.08548v1","updated":"2023-10-12T17:44:59Z","published":"2023-10-12T17:44:59Z","title":"Stronger Coreset Bounds for Kernel Density Estimators via Chaining","summary":" We apply the discrepancy method and a chaining approach to give improved\nbounds on the coreset complexity of a wide class of kernel functions. Our\nresults give randomized polynomial time algorithms to produce coresets of size\n$O\\big(\\frac{\\sqrt{d}}{\\varepsilon}\\sqrt{\\log\\log \\frac{1}{\\varepsilon}}\\big)$\nfor the Gaussian and Laplacian kernels in the case that the data set is\nuniformly bounded, an improvement that was not possible with previous\ntechniques. We also obtain coresets of size\n$O\\big(\\frac{1}{\\varepsilon}\\sqrt{\\log\\log \\frac{1}{\\varepsilon}}\\big)$ for the\nLaplacian kernel for $d$ constant. Finally, we give the best known bounds of\n$O\\big(\\frac{\\sqrt{d}}{\\varepsilon}\\sqrt{\\log(2\\max\\{1,\\alpha\\})}\\big)$ on the\ncoreset complexity of the exponential, Hellinger, and JS Kernels, where\n$1/\\alpha$ is the bandwidth parameter of the kernel.\n","authors":["Rainie Bozzai","Thomas Rothvoss"],"pdf_url":"https://arxiv.org/pdf/2310.08548v1.pdf","comment":"23 pages"},{"id":"http://arxiv.org/abs/2310.08540v1","updated":"2023-10-12T17:32:09Z","published":"2023-10-12T17:32:09Z","title":"Do pretrained Transformers Really Learn In-context by Gradient Descent?","summary":" Is In-Context Learning (ICL) implicitly equivalent to Gradient Descent (GD)?\nSeveral recent works draw analogies between the dynamics of GD and the emergent\nbehavior of ICL in large language models. However, these works make assumptions\nfar from the realistic natural language setting in which language models are\ntrained. Such discrepancies between theory and practice, therefore, necessitate\nfurther investigation to validate their applicability.\n We start by highlighting the weaknesses in prior works that construct\nTransformer weights to simulate gradient descent. Their experiments with\ntraining Transformers on ICL objective, inconsistencies in the order\nsensitivity of ICL and GD, sparsity of the constructed weights, and sensitivity\nto parameter changes are some examples of a mismatch from the real-world\nsetting.\n Furthermore, we probe and compare the ICL vs. GD hypothesis in a natural\nsetting. We conduct comprehensive empirical analyses on language models\npretrained on natural data (LLaMa-7B). Our comparisons on various performance\nmetrics highlight the inconsistent behavior of ICL and GD as a function of\nvarious factors such as datasets, models, and number of demonstrations. We\nobserve that ICL and GD adapt the output distribution of language models\ndifferently. These results indicate that the equivalence between ICL and GD is\nan open hypothesis, requires nuanced considerations and calls for further\nstudies.\n","authors":["Lingfeng Shen","Aayush Mishra","Daniel Khashabi"],"pdf_url":"https://arxiv.org/pdf/2310.08540v1.pdf","comment":null},{"id":"http://arxiv.org/abs/1910.09143v5","updated":"2023-10-12T17:27:48Z","published":"2019-10-21T04:24:29Z","title":"Dynamic Subgoal-based Exploration via Bayesian Optimization","summary":" Reinforcement learning in sparse-reward navigation environments with\nexpensive and limited interactions is challenging and poses a need for\neffective exploration. Motivated by complex navigation tasks that require\nreal-world training (when cheap simulators are not available), we consider an\nagent that faces an unknown distribution of environments and must decide on an\nexploration strategy. It may leverage a series of training environments to\nimprove its policy before it is evaluated in a test environment drawn from the\nsame environment distribution. Most existing approaches focus on fixed\nexploration strategies, while the few that view exploration as a\nmeta-optimization problem tend to ignore the need for cost-efficient\nexploration. We propose a cost-aware Bayesian optimization approach that\nefficiently searches over a class of dynamic subgoal-based exploration\nstrategies. The algorithm adjusts a variety of levers -- the locations of the\nsubgoals, the length of each episode, and the number of replications per trial\n-- in order to overcome the challenges of sparse rewards, expensive\ninteractions, and noise. An experimental evaluation demonstrates that the new\napproach outperforms existing baselines across a number of problem domains. We\nalso provide a theoretical foundation and prove that the method asymptotically\nidentifies a near-optimal subgoal design.\n","authors":["Yijia Wang","Matthias Poloczek","Daniel R. Jiang"],"pdf_url":"https://arxiv.org/pdf/1910.09143v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.05173v2","updated":"2023-10-12T17:25:44Z","published":"2023-09-11T00:02:05Z","title":"DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning","summary":" Prompt tuning (PT), where a small amount of trainable soft (continuous)\nprompt vectors is affixed to the input of language models (LM), has shown\npromising results across various tasks and models for parameter-efficient\nfine-tuning (PEFT). PT stands out from other PEFT approaches because it\nmaintains competitive performance with fewer trainable parameters and does not\ndrastically scale up its parameters as the model size expands. However, PT\nintroduces additional soft prompt tokens, leading to longer input sequences,\nwhich significantly impacts training and inference time and memory usage due to\nthe Transformer's quadratic complexity. Particularly concerning for Large\nLanguage Models (LLMs) that face heavy daily querying. To address this issue,\nwe propose Decomposed Prompt Tuning (DePT), which decomposes the soft prompt\ninto a shorter soft prompt and a pair of low-rank matrices that are then\noptimised with two different learning rates. This allows DePT to achieve better\nperformance while saving over 20% memory and time costs compared to vanilla PT\nand its variants, without changing trainable parameter sizes. Through extensive\nexperiments on 23 natural language processing (NLP) and vision-language (VL)\ntasks, we demonstrate that DePT outperforms state-of-the-art PEFT approaches,\nincluding the full fine-tuning baseline in some scenarios. Additionally, we\nempirically show that DEPT grows more efficient as the model size increases.\nOur further study reveals that DePT integrates seamlessly with\nparameter-efficient transfer learning in the few-shot learning setting and\nhighlights its adaptability to various model architectures and sizes.\n","authors":["Zhengxiang Shi","Aldo Lipani"],"pdf_url":"https://arxiv.org/pdf/2309.05173v2.pdf","comment":"Code is available at https://github.com/ZhengxiangShi/DePT"},{"id":"http://arxiv.org/abs/1908.04628v3","updated":"2023-10-12T17:19:09Z","published":"2019-08-13T13:20:50Z","title":"L2P: Learning to Place for Estimating Heavy-Tailed Distributed Outcomes","summary":" Many real-world prediction tasks have outcome variables that have\ncharacteristic heavy-tail distributions. Examples include copies of books sold,\nauction prices of art pieces, demand for commodities in warehouses, etc. By\nlearning heavy-tailed distributions, \"big and rare\" instances (e.g., the\nbest-sellers) will have accurate predictions. Most existing approaches are not\ndedicated to learning heavy-tailed distribution; thus, they heavily\nunder-predict such instances. To tackle this problem, we introduce Learning to\nPlace (L2P), which exploits the pairwise relationships between instances for\nlearning. In its training phase, L2P learns a pairwise preference classifier:\nis instance A > instance B? In its placing phase, L2P obtains a prediction by\nplacing the new instance among the known instances. Based on its placement, the\nnew instance is then assigned a value for its outcome variable. Experiments on\nreal data show that L2P outperforms competing approaches in terms of accuracy\nand ability to reproduce heavy-tailed outcome distribution. In addition, L2P\nprovides an interpretable model by placing each predicted instance in relation\nto its comparable neighbors. Interpretable models are highly desirable when\nlives and treasure are at stake.\n","authors":["Xindi Wang","Onur Varol","Tina Eliassi-Rad"],"pdf_url":"https://arxiv.org/pdf/1908.04628v3.pdf","comment":"9 pages, 6 figures, 2 tables Nature of changes from previous version:\n 1. Added complexity analysis in Section 2.2 2. Datasets change 3. Added\n LambdaMART in the baseline methods, also a brief discussion on why LambdaMart\n failed in our problem. 4. Figure updates"},{"id":"http://arxiv.org/abs/2310.05898v2","updated":"2023-10-12T17:16:37Z","published":"2023-10-09T17:41:29Z","title":"Lion Secretly Solves Constrained Optimization: As Lyapunov Predicts","summary":" Lion (Evolved Sign Momentum), a new optimizer discovered through program\nsearch, has shown promising results in training large AI models. It performs\ncomparably or favorably to AdamW but with greater memory efficiency. As we can\nexpect from the results of a random search program, Lion incorporates elements\nfrom several existing algorithms, including signed momentum, decoupled weight\ndecay, Polak, and Nesterov momentum, but does not fit into any existing\ncategory of theoretically grounded optimizers. Thus, even though Lion appears\nto perform well as a general-purpose optimizer for a wide range of tasks, its\ntheoretical basis remains uncertain. This lack of theoretical clarity limits\nopportunities to further enhance and expand Lion's efficacy.\n This work aims to demystify Lion. Based on both continuous-time and\ndiscrete-time analysis, we demonstrate that Lion is a theoretically novel and\nprincipled approach for minimizing a general loss function $f(x)$ while\nenforcing a bound constraint $\\|x\\|_\\infty \\leq 1/\\lambda$. Lion achieves this\nthrough the incorporation of decoupled weight decay, where $\\lambda$ represents\nthe weight decay coefficient. Our analysis is made possible by the development\nof a new Lyapunov function for the Lion updates. It applies to a broader family\nof Lion-$\\kappa$ algorithms, where the $\\text{sign}(\\cdot)$ operator in Lion is\nreplaced by the subgradient of a convex function $\\kappa$, leading to the\nsolution of a general composite optimization problem of $\\min_x f(x) +\n\\kappa^*(x)$. Our findings provide valuable insights into the dynamics of Lion\nand pave the way for further improvements and extensions of Lion-related\nalgorithms.\n","authors":["Lizhang Chen","Bo Liu","Kaizhao Liang","Qiang Liu"],"pdf_url":"https://arxiv.org/pdf/2310.05898v2.pdf","comment":"26 pages, 6 figures"},{"id":"http://arxiv.org/abs/2310.06225v2","updated":"2023-10-12T17:06:17Z","published":"2023-10-10T00:39:04Z","title":"GPT-4 as an Agronomist Assistant? Answering Agriculture Exams Using\n Large Language Models","summary":" Large language models (LLMs) have demonstrated remarkable capabilities in\nnatural language understanding across various domains, including healthcare and\nfinance. For some tasks, LLMs achieve similar or better performance than\ntrained human beings, therefore it is reasonable to employ human exams (e.g.,\ncertification tests) to assess the performance of LLMs. We present a\ncomprehensive evaluation of popular LLMs, such as Llama 2 and GPT, on their\nability to answer agriculture-related questions. In our evaluation, we also\nemploy RAG (Retrieval-Augmented Generation) and ER (Ensemble Refinement)\ntechniques, which combine information retrieval, generation capabilities, and\nprompting strategies to improve the LLMs' performance. To demonstrate the\ncapabilities of LLMs, we selected agriculture exams and benchmark datasets from\nthree of the largest agriculture producer countries: Brazil, India, and the\nUSA. Our analysis highlights GPT-4's ability to achieve a passing score on\nexams to earn credits for renewing agronomist certifications, answering 93% of\nthe questions correctly and outperforming earlier general-purpose models, which\nachieved 88% accuracy. On one of our experiments, GPT-4 obtained the highest\nperformance when compared to human subjects. This performance suggests that\nGPT-4 could potentially pass on major graduate education admission tests or\neven earn credits for renewing agronomy certificates. We also explore the\nmodels' capacity to address general agriculture-related questions and generate\ncrop management guidelines for Brazilian and Indian farmers, utilizing robust\ndatasets from the Brazilian Agency of Agriculture (Embrapa) and graduate\nprogram exams from India. The results suggest that GPT-4, ER, and RAG can\ncontribute meaningfully to agricultural education, assessment, and crop\nmanagement practice, offering valuable insights to farmers and agricultural\nprofessionals.\n","authors":["Bruno Silva","Leonardo Nunes","Roberto Estevão","Vijay Aski","Ranveer Chandra"],"pdf_url":"https://arxiv.org/pdf/2310.06225v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08501v1","updated":"2023-10-12T16:59:50Z","published":"2023-10-12T16:59:50Z","title":"Unsupervised Learning of Object-Centric Embeddings for Cell Instance\n Segmentation in Microscopy Images","summary":" Segmentation of objects in microscopy images is required for many biomedical\napplications. We introduce object-centric embeddings (OCEs), which embed image\npatches such that the spatial offsets between patches cropped from the same\nobject are preserved. Those learnt embeddings can be used to delineate\nindividual objects and thus obtain instance segmentations. Here, we show\ntheoretically that, under assumptions commonly found in microscopy images, OCEs\ncan be learnt through a self-supervised task that predicts the spatial offset\nbetween image patches. Together, this forms an unsupervised cell instance\nsegmentation method which we evaluate on nine diverse large-scale microscopy\ndatasets. Segmentations obtained with our method lead to substantially improved\nresults, compared to state-of-the-art baselines on six out of nine datasets,\nand perform on par on the remaining three datasets. If ground-truth annotations\nare available, our method serves as an excellent starting point for supervised\ntraining, reducing the required amount of ground-truth needed by one order of\nmagnitude, thus substantially increasing the practical applicability of our\nmethod. Source code is available at https://github.com/funkelab/cellulus.\n","authors":["Steffen Wolf","Manan Lalit","Henry Westmacott","Katie McDole","Jan Funke"],"pdf_url":"https://arxiv.org/pdf/2310.08501v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08497v1","updated":"2023-10-12T16:56:37Z","published":"2023-10-12T16:56:37Z","title":"Impact of time and note duration tokenizations on deep learning symbolic\n music modeling","summary":" Symbolic music is widely used in various deep learning tasks, including\ngeneration, transcription, synthesis, and Music Information Retrieval (MIR). It\nis mostly employed with discrete models like Transformers, which require music\nto be tokenized, i.e., formatted into sequences of distinct elements called\ntokens. Tokenization can be performed in different ways. As Transformer can\nstruggle at reasoning, but capture more easily explicit information, it is\nimportant to study how the way the information is represented for such model\nimpact their performances. In this work, we analyze the common tokenization\nmethods and experiment with time and note duration representations. We compare\nthe performances of these two impactful criteria on several tasks, including\ncomposer and emotion classification, music generation, and sequence\nrepresentation learning. We demonstrate that explicit information leads to\nbetter results depending on the task.\n","authors":["Nathan Fradet","Nicolas Gutowski","Fabien Chhel","Jean-Pierre Briot"],"pdf_url":"https://arxiv.org/pdf/2310.08497v1.pdf","comment":"ISMIR 2023"},{"id":"http://arxiv.org/abs/2310.08495v1","updated":"2023-10-12T16:55:04Z","published":"2023-10-12T16:55:04Z","title":"Characterizing climate pathways using feature importance on echo state\n networks","summary":" The 2022 National Defense Strategy of the United States listed climate change\nas a serious threat to national security. Climate intervention methods, such as\nstratospheric aerosol injection, have been proposed as mitigation strategies,\nbut the downstream effects of such actions on a complex climate system are not\nwell understood. The development of algorithmic techniques for quantifying\nrelationships between source and impact variables related to a climate event\n(i.e., a climate pathway) would help inform policy decisions. Data-driven deep\nlearning models have become powerful tools for modeling highly nonlinear\nrelationships and may provide a route to characterize climate variable\nrelationships. In this paper, we explore the use of an echo state network (ESN)\nfor characterizing climate pathways. ESNs are a computationally efficient\nneural network variation designed for temporal data, and recent work proposes\nESNs as a useful tool for forecasting spatio-temporal climate data. Like other\nneural networks, ESNs are non-interpretable black-box models, which poses a\nhurdle for understanding variable relationships. We address this issue by\ndeveloping feature importance methods for ESNs in the context of\nspatio-temporal data to quantify variable relationships captured by the model.\nWe conduct a simulation study to assess and compare the feature importance\ntechniques, and we demonstrate the approach on reanalysis climate data. In the\nclimate application, we select a time period that includes the 1991 volcanic\neruption of Mount Pinatubo. This event was a significant stratospheric aerosol\ninjection, which we use as a proxy for an artificial stratospheric aerosol\ninjection. Using the proposed approach, we are able to characterize\nrelationships between pathway variables associated with this event.\n","authors":["Katherine Goode","Daniel Ries","Kellie McClernon"],"pdf_url":"https://arxiv.org/pdf/2310.08495v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08491v1","updated":"2023-10-12T16:50:08Z","published":"2023-10-12T16:50:08Z","title":"Prometheus: Inducing Fine-grained Evaluation Capability in Language\n Models","summary":" Recently, using a powerful proprietary Large Language Model (LLM) (e.g.,\nGPT-4) as an evaluator for long-form responses has become the de facto\nstandard. However, for practitioners with large-scale evaluation tasks and\ncustom criteria in consideration (e.g., child-readability), using proprietary\nLLMs as an evaluator is unreliable due to the closed-source nature,\nuncontrolled versioning, and prohibitive costs. In this work, we propose\nPrometheus, a fully open-source LLM that is on par with GPT-4's evaluation\ncapabilities when the appropriate reference materials (reference answer, score\nrubric) are accompanied. We first construct the Feedback Collection, a new\ndataset that consists of 1K fine-grained score rubrics, 20K instructions, and\n100K responses and language feedback generated by GPT-4. Using the Feedback\nCollection, we train Prometheus, a 13B evaluator LLM that can assess any given\nlong-form text based on customized score rubric provided by the user.\nExperimental results show that Prometheus scores a Pearson correlation of 0.897\nwith human evaluators when evaluating with 45 customized score rubrics, which\nis on par with GPT-4 (0.882), and greatly outperforms ChatGPT (0.392).\nFurthermore, measuring correlation with GPT-4 with 1222 customized score\nrubrics across four benchmarks (MT Bench, Vicuna Bench, Feedback Bench, Flask\nEval) shows similar trends, bolstering Prometheus's capability as an evaluator\nLLM. Lastly, Prometheus achieves the highest accuracy on two human preference\nbenchmarks (HHH Alignment & MT Bench Human Judgment) compared to open-sourced\nreward models explicitly trained on human preference datasets, highlighting its\npotential as an universal reward model. We open-source our code, dataset, and\nmodel at https://github.com/kaistAI/Prometheus.\n","authors":["Seungone Kim","Jamin Shin","Yejin Cho","Joel Jang","Shayne Longpre","Hwaran Lee","Sangdoo Yun","Seongjin Shin","Sungdong Kim","James Thorne","Minjoon Seo"],"pdf_url":"https://arxiv.org/pdf/2310.08491v1.pdf","comment":"Work in Progress"},{"id":"http://arxiv.org/abs/2310.06823v2","updated":"2023-10-12T16:42:55Z","published":"2023-10-10T17:53:36Z","title":"NECO: NEural Collapse Based Out-of-distribution detection","summary":" Detecting out-of-distribution (OOD) data is a critical challenge in machine\nlearning due to model overconfidence, often without awareness of their\nepistemological limits. We hypothesize that ``neural collapse'', a phenomenon\naffecting in-distribution data for models trained beyond loss convergence, also\ninfluences OOD data. To benefit from this interplay, we introduce NECO, a novel\npost-hoc method for OOD detection, which leverages the geometric properties of\n``neural collapse'' and of principal component spaces to identify OOD data. Our\nextensive experiments demonstrate that NECO achieves state-of-the-art results\non both small and large-scale OOD detection tasks while exhibiting strong\ngeneralization capabilities across different network architectures.\nFurthermore, we provide a theoretical explanation for the effectiveness of our\nmethod in OOD detection. We plan to release the code after the anonymity\nperiod.\n","authors":["Mouïn Ben Ammar","Nacim Belkhir","Sebastian Popescu","Antoine Manzanera","Gianni Franchi"],"pdf_url":"https://arxiv.org/pdf/2310.06823v2.pdf","comment":"28 pages"},{"id":"http://arxiv.org/abs/2310.08475v1","updated":"2023-10-12T16:32:44Z","published":"2023-10-12T16:32:44Z","title":"Can We Edit Multimodal Large Language Models?","summary":" In this paper, we focus on editing Multimodal Large Language Models (MLLMs).\nCompared to editing single-modal LLMs, multimodal model editing is more\nchallenging, which demands a higher level of scrutiny and careful consideration\nin the editing process. To facilitate research in this area, we construct a new\nbenchmark, dubbed MMEdit, for editing multimodal LLMs and establishing a suite\nof innovative metrics for evaluation. We conduct comprehensive experiments\ninvolving various model editing baselines and analyze the impact of editing\ndifferent components for multimodal LLMs. Empirically, we notice that previous\nbaselines can implement editing multimodal LLMs to some extent, but the effect\nis still barely satisfactory, indicating the potential difficulty of this task.\nWe hope that our work can provide the NLP community with insights\\footnote{Code\nand dataset are available in https://github.com/zjunlp/EasyEdit.\n","authors":["Siyuan Cheng","Bozhong Tian","Qingbin Liu","Xi Chen","Yongheng Wang","Huajun Chen","Ningyu Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.08475v1.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.08470v1","updated":"2023-10-12T16:28:25Z","published":"2023-10-12T16:28:25Z","title":"Strategies and impact of learning curve estimation for CNN-based image\n classification","summary":" Learning curves are a measure for how the performance of machine learning\nmodels improves given a certain volume of training data. Over a wide variety of\napplications and models it was observed that learning curves follow -- to a\nlarge extent -- a power law behavior. This makes the performance of different\nmodels for a given task somewhat predictable and opens the opportunity to\nreduce the training time for practitioners, who are exploring the space of\npossible models and hyperparameters for the problem at hand. By estimating the\nlearning curve of a model from training on small subsets of data only the best\nmodels need to be considered for training on the full dataset. How to choose\nsubset sizes and how often to sample models on these to obtain estimates is\nhowever not researched. Given that the goal is to reduce overall training time\nstrategies are needed that sample the performance in a time-efficient way and\nyet leads to accurate learning curve estimates. In this paper we formulate the\nframework for these strategies and propose several strategies. Further we\nevaluate the strategies for simulated learning curves and in experiments with\npopular datasets and models for image classification tasks.\n","authors":["Laura Didyk","Brayden Yarish","Michael A. Beck","Christopher P. Bidinosti","Christopher J. Henry"],"pdf_url":"https://arxiv.org/pdf/2310.08470v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08461v1","updated":"2023-10-12T16:21:04Z","published":"2023-10-12T16:21:04Z","title":"DistillSpec: Improving Speculative Decoding via Knowledge Distillation","summary":" Speculative decoding (SD) accelerates large language model inference by\nemploying a faster draft model for generating multiple tokens, which are then\nverified in parallel by the larger target model, resulting in the text\ngenerated according to the target model distribution. However, identifying a\ncompact draft model that is well-aligned with the target model is challenging.\nTo tackle this issue, we propose DistillSpec that uses knowledge distillation\nto better align the draft model with the target model, before applying SD.\nDistillSpec makes two key design choices, which we demonstrate via systematic\nstudy to be crucial to improving the draft and target alignment: utilizing\non-policy data generation from the draft model, and tailoring the divergence\nfunction to the task and decoding strategy. Notably, DistillSpec yields\nimpressive 10 - 45% speedups over standard SD on a range of standard\nbenchmarks, using both greedy and non-greedy sampling. Furthermore, we combine\nDistillSpec with lossy SD to achieve fine-grained control over the latency vs.\ntask performance trade-off. Finally, in practical scenarios with models of\nvarying sizes, first using distillation to boost the performance of the target\nmodel and then applying DistillSpec to train a well-aligned draft model can\nreduce decoding latency by 6-10x with minimal performance drop, compared to\nstandard decoding without distillation.\n","authors":["Yongchao Zhou","Kaifeng Lyu","Ankit Singh Rawat","Aditya Krishna Menon","Afshin Rostamizadeh","Sanjiv Kumar","Jean-François Kagy","Rishabh Agarwal"],"pdf_url":"https://arxiv.org/pdf/2310.08461v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08459v1","updated":"2023-10-12T16:19:58Z","published":"2023-10-12T16:19:58Z","title":"A Survey on Heterogeneous Transfer Learning","summary":" The application of transfer learning, an approach utilizing knowledge from a\nsource domain to enhance model performance in a target domain, has seen a\ntremendous rise in recent years, underpinning many real-world scenarios. The\nkey to its success lies in the shared common knowledge between the domains, a\nprerequisite in most transfer learning methodologies. These methods typically\npresuppose identical feature spaces and label spaces in both domains, known as\nhomogeneous transfer learning, which, however, is not always a practical\nassumption. Oftentimes, the source and target domains vary in feature spaces,\ndata distributions, and label spaces, making it challenging or costly to secure\nsource domain data with identical feature and label spaces as the target\ndomain. Arbitrary elimination of these differences is not always feasible or\noptimal. Thus, heterogeneous transfer learning, acknowledging and dealing with\nsuch disparities, has emerged as a promising approach for a variety of tasks.\nDespite the existence of a survey in 2017 on this topic, the fast-paced\nadvances post-2017 necessitate an updated, in-depth review. We therefore\npresent a comprehensive survey of recent developments in heterogeneous transfer\nlearning methods, offering a systematic guide for future research. Our paper\nreviews methodologies for diverse learning scenarios, discusses the limitations\nof current studies, and covers various application contexts, including Natural\nLanguage Processing, Computer Vision, Multimodality, and Biomedicine, to foster\na deeper understanding and spur future research.\n","authors":["Runxue Bao","Yiming Sun","Yuhe Gao","Jindong Wang","Qiang Yang","Haifeng Chen","Zhi-Hong Mao","Xing Xie","Ye Ye"],"pdf_url":"https://arxiv.org/pdf/2310.08459v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2211.15363v3","updated":"2023-10-12T16:12:57Z","published":"2022-11-28T14:38:45Z","title":"On the Security Vulnerabilities of Text-to-SQL Models","summary":" Although it has been demonstrated that Natural Language Processing (NLP)\nalgorithms are vulnerable to deliberate attacks, the question of whether such\nweaknesses can lead to software security threats is under-explored. To bridge\nthis gap, we conducted vulnerability tests on Text-to-SQL systems that are\ncommonly used to create natural language interfaces to databases. We showed\nthat the Text-to-SQL modules within six commercial applications can be\nmanipulated to produce malicious code, potentially leading to data breaches and\nDenial of Service attacks. This is the first demonstration that NLP models can\nbe exploited as attack vectors in the wild. In addition, experiments using four\nopen-source language models verified that straightforward backdoor attacks on\nText-to-SQL systems achieve a 100% success rate without affecting their\nperformance. The aim of this work is to draw the community's attention to\npotential software security issues associated with NLP algorithms and encourage\nexploration of methods to mitigate against them.\n","authors":["Xutan Peng","Yipeng Zhang","Jingfeng Yang","Mark Stevenson"],"pdf_url":"https://arxiv.org/pdf/2211.15363v3.pdf","comment":"ISSRE 2023: Best Paper Candidate"},{"id":"http://arxiv.org/abs/2305.14259v3","updated":"2023-10-12T16:10:51Z","published":"2023-05-23T17:12:08Z","title":"Learning to Generate Novel Scientific Directions with Contextualized\n Literature-based Discovery","summary":" Literature-Based Discovery (LBD) aims to discover new scientific knowledge by\nmining papers and generating hypotheses. Standard LBD is limited to predicting\npairwise relations between discrete concepts (e.g., drug-disease links), and\nignores critical contexts like experimental settings (e.g., a specific patient\npopulation where a drug is evaluated) and background motivations (e.g., to find\ndrugs without specific side effects). We address these limitations with a novel\nformulation of contextualized-LBD (C-LBD): generating scientific hypotheses in\nnatural language, while grounding them in a context that controls the\nhypothesis search space. We present a modeling framework using retrieval of\n``inspirations'' from past scientific papers. Our evaluations reveal that GPT-4\ntends to generate ideas with overall low technical depth and novelty, while our\ninspiration prompting approaches partially mitigate this issue. Our work\nrepresents a first step toward building language models that generate new ideas\nderived from scientific literature.\n","authors":["Qingyun Wang","Doug Downey","Heng Ji","Tom Hope"],"pdf_url":"https://arxiv.org/pdf/2305.14259v3.pdf","comment":"24 pages. Code and resource is available at\n https://github.com/EagleW/CLBD"},{"id":"http://arxiv.org/abs/2310.08446v1","updated":"2023-10-12T16:06:18Z","published":"2023-10-12T16:06:18Z","title":"Towards Robust Multi-Modal Reasoning via Model Selection","summary":" The reasoning capabilities of LLM (Large Language Model) are widely\nacknowledged in recent research, inspiring studies on tool learning and\nautonomous agents. LLM serves as the \"brain\" of agent, orchestrating multiple\ntools for collaborative multi-step task solving. Unlike methods invoking tools\nlike calculators or weather APIs for straightforward tasks, multi-modal agents\nexcel by integrating diverse AI models for complex challenges. However, current\nmulti-modal agents neglect the significance of model selection: they primarily\nfocus on the planning and execution phases, and will only invoke predefined\ntask-specific models for each subtask, making the execution fragile. Meanwhile,\nother traditional model selection methods are either incompatible with or\nsuboptimal for the multi-modal agent scenarios, due to ignorance of\ndependencies among subtasks arising by multi-step reasoning.\n To this end, we identify the key challenges therein and propose the\n$\\textit{M}^3$ framework as a plug-in with negligible runtime overhead at\ntest-time. This framework improves model selection and bolsters the robustness\nof multi-modal agents in multi-step reasoning. In the absence of suitable\nbenchmarks, we create MS-GQA, a new dataset specifically designed to\ninvestigate the model selection challenge in multi-modal agents. Our\nexperiments reveal that our framework enables dynamic model selection,\nconsidering both user inputs and subtask dependencies, thereby robustifying the\noverall reasoning process. Our code and benchmark:\nhttps://github.com/LINs-lab/M3.\n","authors":["Xiangyan Liu","Rongxue Li","Wei Ji","Tao Lin"],"pdf_url":"https://arxiv.org/pdf/2310.08446v1.pdf","comment":"10 pages, 5 figures"},{"id":"http://arxiv.org/abs/2308.16198v2","updated":"2023-10-12T15:57:52Z","published":"2023-08-25T21:30:16Z","title":"Learning Collaborative Information Dissemination with Graph-based\n Multi-Agent Reinforcement Learning","summary":" In modern communication systems, efficient and reliable information\ndissemination is crucial for supporting critical operations across domains like\ndisaster response, autonomous vehicles, and sensor networks. This paper\nintroduces a Multi-Agent Reinforcement Learning (MARL) approach as a\nsignificant step forward in achieving more decentralized, efficient, and\ncollaborative solutions. We propose a Partially Observable Stochastic Game\n(POSG) formulation for information dissemination empowering each agent to\ndecide on message forwarding independently, based on their one-hop\nneighborhood. This constitutes a significant paradigm shift from traditional\nheuristics based on Multi-Point Relay (MPR) selection. Our approach harnesses\nGraph Convolutional Reinforcement Learning, employing Graph Attention Networks\n(GAT) with dynamic attention to capture essential network features. We propose\ntwo approaches, L-DGN and HL-DGN, which differ in the information that is\nexchanged among agents. We evaluate the performance of our decentralized\napproaches, by comparing them with a widely-used MPR heuristic, and we show\nthat our trained policies are able to efficiently cover the network while\nbypassing the MPR set selection process. Our approach is a first step toward\nsupporting the resilience of real-world broadcast communication infrastructures\nvia learned, collaborative information dissemination.\n","authors":["Raffaele Galliera","Kristen Brent Venable","Matteo Bassani","Niranjan Suri"],"pdf_url":"https://arxiv.org/pdf/2308.16198v2.pdf","comment":"11 pages (2 of Supplementary Materials), 4 figures, 3 tables"},{"id":"http://arxiv.org/abs/2310.08431v1","updated":"2023-10-12T15:56:02Z","published":"2023-10-12T15:56:02Z","title":"Neural Sampling in Hierarchical Exponential-family Energy-based Models","summary":" Bayesian brain theory suggests that the brain employs generative models to\nunderstand the external world. The sampling-based perspective posits that the\nbrain infers the posterior distribution through samples of stochastic neuronal\nresponses. Additionally, the brain continually updates its generative model to\napproach the true distribution of the external world. In this study, we\nintroduce the Hierarchical Exponential-family Energy-based (HEE) model, which\ncaptures the dynamics of inference and learning. In the HEE model, we decompose\nthe partition function into individual layers and leverage a group of neurons\nwith shorter time constants to sample the gradient of the decomposed\nnormalization term. This allows our model to estimate the partition function\nand perform inference simultaneously, circumventing the negative phase\nencountered in conventional energy-based models (EBMs). As a result, the\nlearning process is localized both in time and space, and the model is easy to\nconverge. To match the brain's rapid computation, we demonstrate that neural\nadaptation can serve as a momentum term, significantly accelerating the\ninference process. On natural image datasets, our model exhibits\nrepresentations akin to those observed in the biological visual system.\nFurthermore, for the machine learning community, our model can generate\nobservations through joint or marginal generation. We show that marginal\ngeneration outperforms joint generation and achieves performance on par with\nother EBMs.\n","authors":["Xingsi Dong","Si Wu"],"pdf_url":"https://arxiv.org/pdf/2310.08431v1.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2304.09663v2","updated":"2023-10-12T15:52:37Z","published":"2023-04-19T13:50:13Z","title":"Generative modeling of time-dependent densities via optimal transport\n and projection pursuit","summary":" Motivated by the computational difficulties incurred by popular deep learning\nalgorithms for the generative modeling of temporal densities, we propose a\ncheap alternative which requires minimal hyperparameter tuning and scales\nfavorably to high dimensional problems. In particular, we use a\nprojection-based optimal transport solver [Meng et al., 2019] to join\nsuccessive samples and subsequently use transport splines [Chewi et al., 2020]\nto interpolate the evolving density. When the sampling frequency is\nsufficiently high, the optimal maps are close to the identity and are thus\ncomputationally efficient to compute. Moreover, the training process is highly\nparallelizable as all optimal maps are independent and can thus be learned\nsimultaneously. Finally, the approach is based solely on numerical linear\nalgebra rather than minimizing a nonconvex objective function, allowing us to\neasily analyze and control the algorithm. We present several numerical\nexperiments on both synthetic and real-world datasets to demonstrate the\nefficiency of our method. In particular, these experiments show that the\nproposed approach is highly competitive compared with state-of-the-art\nnormalizing flows conditioned on time across a wide range of dimensionalities.\n","authors":["Jonah Botvinick-Greenhouse","Yunan Yang","Romit Maulik"],"pdf_url":"https://arxiv.org/pdf/2304.09663v2.pdf","comment":"This article may be downloaded for personal use only. Any other use\n requires prior permission of the author and AIP Publishing. This article\n appeared in Chaos: An Interdisciplinary Journal of Nonlinear Science, Volume\n 33, Issue 10, October 2023 and may be found at\n https://doi.org/10.1063/5.0155783"},{"id":"http://arxiv.org/abs/2310.08425v1","updated":"2023-10-12T15:48:14Z","published":"2023-10-12T15:48:14Z","title":"Differentially Private Non-convex Learning for Multi-layer Neural\n Networks","summary":" This paper focuses on the problem of Differentially Private Stochastic\nOptimization for (multi-layer) fully connected neural networks with a single\noutput node. In the first part, we examine cases with no hidden nodes,\nspecifically focusing on Generalized Linear Models (GLMs). We investigate the\nwell-specific model where the random noise possesses a zero mean, and the link\nfunction is both bounded and Lipschitz continuous. We propose several\nalgorithms and our analysis demonstrates the feasibility of achieving an excess\npopulation risk that remains invariant to the data dimension. We also delve\ninto the scenario involving the ReLU link function, and our findings mirror\nthose of the bounded link function. We conclude this section by contrasting\nwell-specified and misspecified models, using ReLU regression as a\nrepresentative example.\n In the second part of the paper, we extend our ideas to two-layer neural\nnetworks with sigmoid or ReLU activation functions in the well-specified model.\nIn the third part, we study the theoretical guarantees of DP-SGD in Abadi et\nal. (2016) for fully connected multi-layer neural networks. By utilizing recent\nadvances in Neural Tangent Kernel theory, we provide the first excess\npopulation risk when both the sample size and the width of the network are\nsufficiently large. Additionally, we discuss the role of some parameters in\nDP-SGD regarding their utility, both theoretically and empirically.\n","authors":["Hanpu Shen","Cheng-Long Wang","Zihang Xiang","Yiming Ying","Di Wang"],"pdf_url":"https://arxiv.org/pdf/2310.08425v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2211.12345v4","updated":"2023-10-12T15:44:42Z","published":"2022-11-22T15:34:59Z","title":"Understanding Sparse Feature Updates in Deep Networks using Iterative\n Linearisation","summary":" Larger and deeper networks generalise well despite their increased capacity\nto overfit. Understanding why this happens is theoretically and practically\nimportant. One recent approach looks at the infinitely wide limits of such\nnetworks and their corresponding kernels. However, these theoretical tools\ncannot fully explain finite networks as the empirical kernel changes\nsignificantly during gradient-descent-based training in contrast to infinite\nnetworks. In this work, we derive an iterative linearised training method as a\nnovel empirical tool to further investigate this distinction, allowing us to\ncontrol for sparse (i.e. infrequent) feature updates and quantify the frequency\nof feature learning needed to achieve comparable performance. We justify\niterative linearisation as an interpolation between a finite analog of the\ninfinite width regime, which does not learn features, and standard gradient\ndescent training, which does. Informally, we also show that it is analogous to\na damped version of the Gauss-Newton algorithm -- a second-order method. We\nshow that in a variety of cases, iterative linearised training surprisingly\nperforms on par with standard training, noting in particular how much less\nfrequent feature learning is required to achieve comparable performance. We\nalso show that feature learning is essential for good performance. Since such\nfeature learning inevitably causes changes in the NTK kernel, we provide direct\nnegative evidence for the NTK theory, which states the NTK kernel remains\nconstant during training.\n","authors":["Adrian Goldwaser","Hong Ge"],"pdf_url":"https://arxiv.org/pdf/2211.12345v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08419v1","updated":"2023-10-12T15:38:28Z","published":"2023-10-12T15:38:28Z","title":"Jailbreaking Black Box Large Language Models in Twenty Queries","summary":" There is growing interest in ensuring that large language models (LLMs) align\nwith human values. However, the alignment of such models is vulnerable to\nadversarial jailbreaks, which coax LLMs into overriding their safety\nguardrails. The identification of these vulnerabilities is therefore\ninstrumental in understanding inherent weaknesses and preventing future misuse.\nTo this end, we propose Prompt Automatic Iterative Refinement (PAIR), an\nalgorithm that generates semantic jailbreaks with only black-box access to an\nLLM. PAIR -- which is inspired by social engineering attacks -- uses an\nattacker LLM to automatically generate jailbreaks for a separate targeted LLM\nwithout human intervention. In this way, the attacker LLM iteratively queries\nthe target LLM to update and refine a candidate jailbreak. Empirically, PAIR\noften requires fewer than twenty queries to produce a jailbreak, which is\norders of magnitude more efficient than existing algorithms. PAIR also achieves\ncompetitive jailbreaking success rates and transferability on open and\nclosed-source LLMs, including GPT-3.5/4, Vicuna, and PaLM-2.\n","authors":["Patrick Chao","Alexander Robey","Edgar Dobriban","Hamed Hassani","George J. Pappas","Eric Wong"],"pdf_url":"https://arxiv.org/pdf/2310.08419v1.pdf","comment":"21 pages, 10 figures"},{"id":"http://arxiv.org/abs/2303.10108v2","updated":"2023-10-12T15:24:28Z","published":"2023-03-17T16:39:21Z","title":"Data-Centric Learning from Unlabeled Graphs with Diffusion Model","summary":" Graph property prediction tasks are important and numerous. While each task\noffers a small size of labeled examples, unlabeled graphs have been collected\nfrom various sources and at a large scale. A conventional approach is training\na model with the unlabeled graphs on self-supervised tasks and then fine-tuning\nthe model on the prediction tasks. However, the self-supervised task knowledge\ncould not be aligned or sometimes conflicted with what the predictions needed.\nIn this paper, we propose to extract the knowledge underlying the large set of\nunlabeled graphs as a specific set of useful data points to augment each\nproperty prediction model. We use a diffusion model to fully utilize the\nunlabeled graphs and design two new objectives to guide the model's denoising\nprocess with each task's labeled data to generate task-specific graph examples\nand their labels. Experiments demonstrate that our data-centric approach\nperforms significantly better than fifteen existing various methods on fifteen\ntasks. The performance improvement brought by unlabeled data is visible as the\ngenerated labeled examples unlike the self-supervised learning.\n","authors":["Gang Liu","Eric Inae","Tong Zhao","Jiaxin Xu","Tengfei Luo","Meng Jiang"],"pdf_url":"https://arxiv.org/pdf/2303.10108v2.pdf","comment":"Accepted by NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.08394v1","updated":"2023-10-12T15:07:11Z","published":"2023-10-12T15:07:11Z","title":"Towards Better Evaluation of Instruction-Following: A Case-Study in\n Summarization","summary":" Despite recent advances, evaluating how well large language models (LLMs)\nfollow user instructions remains an open problem. While evaluation methods of\nlanguage models have seen a rise in prompt-based approaches, limited work on\nthe correctness of these methods has been conducted. In this work, we perform a\nmeta-evaluation of a variety of metrics to quantify how accurately they measure\nthe instruction-following abilities of LLMs. Our investigation is performed on\ngrounded query-based summarization by collecting a new short-form, real-world\ndataset riSum, containing $300$ document-instruction pairs with $3$ answers\neach. All $900$ answers are rated by $3$ human annotators. Using riSum, we\nanalyze agreement between evaluation methods and human judgment. Finally, we\npropose new LLM-based reference-free evaluation methods that improve upon\nestablished baselines and perform on-par with costly reference-based metrics\nwhich require high-quality summaries.\n","authors":["Ondrej Skopek","Rahul Aralikatte","Sian Gooding","Victor Carbune"],"pdf_url":"https://arxiv.org/pdf/2310.08394v1.pdf","comment":"Accepted to CoNLL 2023"},{"id":"http://arxiv.org/abs/2308.12243v2","updated":"2023-10-12T15:06:50Z","published":"2023-08-23T16:42:27Z","title":"Multi-Objective Optimization for Sparse Deep Neural Network Training","summary":" Different conflicting optimization criteria arise naturally in various Deep\nLearning scenarios. These can address different main tasks (i.e., in the\nsetting of Multi-Task Learning), but also main and secondary tasks such as loss\nminimization versus sparsity. The usual approach is a simple weighting of the\ncriteria, which formally only works in the convex setting. In this paper, we\npresent a Multi-Objective Optimization algorithm using a modified Weighted\nChebyshev scalarization for training Deep Neural Networks (DNNs) with respect\nto several tasks. By employing this scalarization technique, the algorithm can\nidentify all optimal solutions of the original problem while reducing its\ncomplexity to a sequence of single-objective problems. The simplified problems\nare then solved using an Augmented Lagrangian method, enabling the use of\npopular optimization techniques such as Adam and Stochastic Gradient Descent,\nwhile efficaciously handling constraints. Our work aims to address the\n(economical and also ecological) sustainability issue of DNN models, with a\nparticular focus on Deep Multi-Task models, which are typically designed with a\nvery large number of weights to perform equally well on multiple tasks. Through\nexperiments conducted on two Machine Learning datasets, we demonstrate the\npossibility of adaptively sparsifying the model during training without\nsignificantly impacting its performance, if we are willing to apply\ntask-specific adaptations to the network weights. Code is available at\nhttps://github.com/salomonhotegni/MDMTN.\n","authors":["S. S. Hotegni","S. Peitz","M. Berkemeier"],"pdf_url":"https://arxiv.org/pdf/2308.12243v2.pdf","comment":"13 pages, 7 figures"},{"id":"http://arxiv.org/abs/2304.02621v2","updated":"2023-10-12T15:05:15Z","published":"2023-04-05T17:43:57Z","title":"High-fidelity Pseudo-labels for Boosting Weakly-Supervised Segmentation","summary":" Image-level weakly-supervised semantic segmentation (WSSS) reduces the\nusually vast data annotation cost by surrogate segmentation masks during\ntraining. The typical approach involves training an image classification\nnetwork using global average pooling (GAP) on convolutional feature maps. This\nenables the estimation of object locations based on class activation maps\n(CAMs), which identify the importance of image regions. The CAMs are then used\nto generate pseudo-labels, in the form of segmentation masks, to supervise a\nsegmentation model in the absence of pixel-level ground truth. Our work is\nbased on two techniques for improving CAMs; importance sampling, which is a\nsubstitute for GAP, and the feature similarity loss, which utilizes a heuristic\nthat object contours almost always align with color edges in images. However,\nboth are based on the multinomial posterior with softmax, and implicitly assume\nthat classes are mutually exclusive, which turns out suboptimal in our\nexperiments. Thus, we reformulate both techniques based on binomial posteriors\nof multiple independent binary problems. This has two benefits; their\nperformance is improved and they become more general, resulting in an add-on\nmethod that can boost virtually any WSSS method. This is demonstrated on a wide\nvariety of baselines on the PASCAL VOC dataset, improving the region similarity\nand contour quality of all implemented state-of-the-art methods. Experiments on\nthe MS COCO dataset show that our proposed add-on is well-suited for\nlarge-scale settings. Our code is available at https://github.com/arvijj/hfpl.\n","authors":["Arvi Jonnarth","Yushan Zhang","Michael Felsberg"],"pdf_url":"https://arxiv.org/pdf/2304.02621v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08392v1","updated":"2023-10-12T15:03:50Z","published":"2023-10-12T15:03:50Z","title":"Introducing a Deep Neural Network-based Model Predictive Control\n Framework for Rapid Controller Implementation","summary":" Model Predictive Control (MPC) provides an optimal control solution based on\na cost function while allowing for the implementation of process constraints.\nAs a model-based optimal control technique, the performance of MPC strongly\ndepends on the model used where a trade-off between model computation time and\nprediction performance exists. One solution is the integration of MPC with a\nmachine learning (ML) based process model which are quick to evaluate online.\nThis work presents the experimental implementation of a deep neural network\n(DNN) based nonlinear MPC for Homogeneous Charge Compression Ignition (HCCI)\ncombustion control. The DNN model consists of a Long Short-Term Memory (LSTM)\nnetwork surrounded by fully connected layers which was trained using\nexperimental engine data and showed acceptable prediction performance with\nunder 5% error for all outputs. Using this model, the MPC is designed to track\nthe Indicated Mean Effective Pressure (IMEP) and combustion phasing\ntrajectories, while minimizing several parameters. Using the acados software\npackage to enable the real-time implementation of the MPC on an ARM Cortex A72,\nthe optimization calculations are completed within 1.4 ms. The external A72\nprocessor is integrated with the prototyping engine controller using a UDP\nconnection allowing for rapid experimental deployment of the NMPC. The IMEP\ntrajectory following of the developed controller was excellent, with a\nroot-mean-square error of 0.133 bar, in addition to observing process\nconstraints.\n","authors":["David C. Gordon","Alexander Winkler","Julian Bedei","Patrick Schaber","Jakob Andert","Charles R. Koch"],"pdf_url":"https://arxiv.org/pdf/2310.08392v1.pdf","comment":"Submitted to 2024 American Control Conference (ACC), July 8-12, 2024\n in Toronto, Canada. ACC is the annual conference of the American Automatic\n Control Council (AACC), the U.S. national member organization of the\n International Federation for Automatic Control (IFAC)"},{"id":"http://arxiv.org/abs/2310.08391v1","updated":"2023-10-12T15:01:43Z","published":"2023-10-12T15:01:43Z","title":"How Many Pretraining Tasks Are Needed for In-Context Learning of Linear\n Regression?","summary":" Transformers pretrained on diverse tasks exhibit remarkable in-context\nlearning (ICL) capabilities, enabling them to solve unseen tasks solely based\non input contexts without adjusting model parameters. In this paper, we study\nICL in one of its simplest setups: pretraining a linearly parameterized\nsingle-layer linear attention model for linear regression with a Gaussian\nprior. We establish a statistical task complexity bound for the attention model\npretraining, showing that effective pretraining only requires a small number of\nindependent tasks. Furthermore, we prove that the pretrained model closely\nmatches the Bayes optimal algorithm, i.e., optimally tuned ridge regression, by\nachieving nearly Bayes optimal risk on unseen tasks under a fixed context\nlength. These theoretical findings complement prior experimental research and\nshed light on the statistical foundations of ICL.\n","authors":["Jingfeng Wu","Difan Zou","Zixiang Chen","Vladimir Braverman","Quanquan Gu","Peter L. Bartlett"],"pdf_url":"https://arxiv.org/pdf/2310.08391v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.15395v2","updated":"2023-10-12T15:00:55Z","published":"2023-09-27T04:33:09Z","title":"Model-Free, Regret-Optimal Best Policy Identification in Online CMDPs","summary":" This paper considers the best policy identification (BPI) problem in online\nConstrained Markov Decision Processes (CMDPs). We are interested in algorithms\nthat are model-free, have low regret, and identify an optimal policy with a\nhigh probability. Existing model-free algorithms for online CMDPs with\nsublinear regret and constraint violation do not provide any convergence\nguarantee to an optimal policy and provide only average performance guarantees\nwhen a policy is uniformly sampled at random from all previously used policies.\nIn this paper, we develop a new algorithm, named\nPruning-Refinement-Identification (PRI), based on a fundamental structural\nproperty of CMDPs we discover, called limited stochasticity. The property says\nfor a CMDP with $N$ constraints, there exists an optimal policy with at most\n$N$ stochastic decisions.\n The proposed algorithm first identifies at which step and in which state a\nstochastic decision has to be taken and then fine-tunes the distributions of\nthese stochastic decisions. PRI achieves trio objectives: (i) PRI is a\nmodel-free algorithm; and (ii) it outputs a near-optimal policy with a high\nprobability at the end of learning; and (iii) in the tabular setting, PRI\nguarantees $\\tilde{\\mathcal{O}}(\\sqrt{K})$ regret and constraint violation,\nwhich significantly improves the best existing regret bound\n$\\tilde{\\mathcal{O}}(K^{\\frac{4}{5}})$ under a model-free algorithm, where $K$\nis the total number of episodes.\n","authors":["Zihan Zhou","Honghao Wei","Lei Ying"],"pdf_url":"https://arxiv.org/pdf/2309.15395v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08387v1","updated":"2023-10-12T14:59:22Z","published":"2023-10-12T14:59:22Z","title":"MeanAP-Guided Reinforced Active Learning for Object Detection","summary":" Active learning presents a promising avenue for training high-performance\nmodels with minimal labeled data, achieved by judiciously selecting the most\ninformative instances to label and incorporating them into the task learner.\nDespite notable advancements in active learning for image recognition, metrics\ndevised or learned to gauge the information gain of data, crucial for query\nstrategy design, do not consistently align with task model performance metrics,\nsuch as Mean Average Precision (MeanAP) in object detection tasks. This paper\nintroduces MeanAP-Guided Reinforced Active Learning for Object Detection\n(MAGRAL), a novel approach that directly utilizes the MeanAP metric of the task\nmodel to devise a sampling strategy employing a reinforcement learning-based\nsampling agent. Built upon LSTM architecture, the agent efficiently explores\nand selects subsequent training instances, and optimizes the process through\npolicy gradient with MeanAP serving as reward. Recognizing the time-intensive\nnature of MeanAP computation at each step, we propose fast look-up tables to\nexpedite agent training. We assess MAGRAL's efficacy across popular benchmarks,\nPASCAL VOC and MS COCO, utilizing different backbone architectures. Empirical\nfindings substantiate MAGRAL's superiority over recent state-of-the-art\nmethods, showcasing substantial performance gains. MAGRAL establishes a robust\nbaseline for reinforced active object detection, signifying its potential in\nadvancing the field.\n","authors":["Zhixuan Liang","Xingyu Zeng","Rui Zhao","Ping Luo"],"pdf_url":"https://arxiv.org/pdf/2310.08387v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08381v1","updated":"2023-10-12T14:55:31Z","published":"2023-10-12T14:55:31Z","title":"AutoVP: An Automated Visual Prompting Framework and Benchmark","summary":" Visual prompting (VP) is an emerging parameter-efficient fine-tuning approach\nto adapting pre-trained vision models to solve various downstream\nimage-classification tasks. However, there has hitherto been little systematic\nstudy of the design space of VP and no clear benchmark for evaluating its\nperformance. To bridge this gap, we propose AutoVP, an end-to-end expandable\nframework for automating VP design choices, along with 12 downstream\nimage-classification tasks that can serve as a holistic VP-performance\nbenchmark. Our design space covers 1) the joint optimization of the prompts; 2)\nthe selection of pre-trained models, including image classifiers and text-image\nencoders; and 3) model output mapping strategies, including nonparametric and\ntrainable label mapping. Our extensive experimental results show that AutoVP\noutperforms the best-known current VP methods by a substantial margin, having\nup to 6.7% improvement in accuracy; and attains a maximum performance increase\nof 27.5% compared to linear-probing (LP) baseline. AutoVP thus makes a two-fold\ncontribution: serving both as an efficient tool for hyperparameter tuning on VP\ndesign choices, and as a comprehensive benchmark that can reasonably be\nexpected to accelerate VP's development. The source code is available at\nhttps://github.com/IBM/AutoVP.\n","authors":["Hsi-Ai Tsao","Lei Hsiung","Pin-Yu Chen","Sijia Liu","Tsung-Yi Ho"],"pdf_url":"https://arxiv.org/pdf/2310.08381v1.pdf","comment":"Preprint. The code is available at https://github.com/IBM/AutoVP"},{"id":"http://arxiv.org/abs/2306.16335v2","updated":"2023-10-12T14:46:15Z","published":"2023-06-28T16:11:50Z","title":"Emulating the dynamics of complex systems using autoregressive models on\n manifolds (mNARX)","summary":" We propose a novel surrogate modelling approach to efficiently and accurately\napproximate the response of complex dynamical systems driven by time-varying\nexogenous excitations over extended time periods. Our approach, namely manifold\nnonlinear autoregressive modelling with exogenous input (mNARX), involves\nconstructing a problem-specific exogenous input manifold that is optimal for\nconstructing autoregressive surrogates. The manifold, which forms the core of\nmNARX, is constructed incrementally by incorporating the physics of the system,\nas well as prior expert- and domain- knowledge. Because mNARX decomposes the\nfull problem into a series of smaller sub-problems, each with a lower\ncomplexity than the original, it scales well with the complexity of the\nproblem, both in terms of training and evaluation costs of the final surrogate.\nFurthermore, mNARX synergizes well with traditional dimensionality reduction\ntechniques, making it highly suitable for modelling dynamical systems with\nhigh-dimensional exogenous inputs, a class of problems that is typically\nchallenging to solve. Since domain knowledge is particularly abundant in\nphysical systems, such as those found in civil and mechanical engineering,\nmNARX is well suited for these applications. We demonstrate that mNARX\noutperforms traditional autoregressive surrogates in predicting the response of\na classical coupled spring-mass system excited by a one-dimensional random\nexcitation. Additionally, we show that mNARX is well suited for emulating very\nhigh-dimensional time- and state-dependent systems, even when affected by\nactive controllers, by surrogating the dynamics of a realistic\naero-servo-elastic onshore wind turbine simulator. In general, our results\ndemonstrate that mNARX offers promising prospects for modelling complex\ndynamical systems, in terms of accuracy and efficiency.\n","authors":["Styfen Schär","Stefano Marelli","Bruno Sudret"],"pdf_url":"https://arxiv.org/pdf/2306.16335v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08367v1","updated":"2023-10-12T14:38:25Z","published":"2023-10-12T14:38:25Z","title":"MCU: A Task-centric Framework for Open-ended Agent Evaluation in\n Minecraft","summary":" To pursue the goal of creating an open-ended agent in Minecraft, an\nopen-ended game environment with unlimited possibilities, this paper introduces\na task-centric framework named MCU for Minecraft agent evaluation. The MCU\nframework leverages the concept of atom tasks as fundamental building blocks,\nenabling the generation of diverse or even arbitrary tasks. Within the MCU\nframework, each task is measured with six distinct difficulty scores (time\nconsumption, operational effort, planning complexity, intricacy, creativity,\nnovelty). These scores offer a multi-dimensional assessment of a task from\ndifferent angles, and thus can reveal an agent's capability on specific facets.\nThe difficulty scores also serve as the feature of each task, which creates a\nmeaningful task space and unveils the relationship between tasks. For efficient\nevaluation of Minecraft agents employing the MCU framework, we maintain a\nunified benchmark, namely SkillForge, which comprises representative tasks with\ndiverse categories and difficulty distribution. We also provide convenient\nfilters for users to select tasks to assess specific capabilities of agents. We\nshow that MCU has the high expressivity to cover all tasks used in recent\nliterature on Minecraft agent, and underscores the need for advancements in\nareas such as creativity, precise control, and out-of-distribution\ngeneralization under the goal of open-ended Minecraft agent development.\n","authors":["Haowei Lin","Zihao Wang","Jianzhu Ma","Yitao Liang"],"pdf_url":"https://arxiv.org/pdf/2310.08367v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08358v1","updated":"2023-10-12T14:29:02Z","published":"2023-10-12T14:29:02Z","title":"Towards Demystifying the Generalization Behaviors When Neural Collapse\n Emerges","summary":" Neural Collapse (NC) is a well-known phenomenon of deep neural networks in\nthe terminal phase of training (TPT). It is characterized by the collapse of\nfeatures and classifier into a symmetrical structure, known as simplex\nequiangular tight frame (ETF). While there have been extensive studies on\noptimization characteristics showing the global optimality of neural collapse,\nlittle research has been done on the generalization behaviors during the\noccurrence of NC. Particularly, the important phenomenon of generalization\nimprovement during TPT has been remaining in an empirical observation and\nlacking rigorous theoretical explanation. In this paper, we establish the\nconnection between the minimization of CE and a multi-class SVM during TPT, and\nthen derive a multi-class margin generalization bound, which provides a\ntheoretical explanation for why continuing training can still lead to accuracy\nimprovement on test set, even after the train accuracy has reached 100%.\nAdditionally, our further theoretical results indicate that different alignment\nbetween labels and features in a simplex ETF can result in varying degrees of\ngeneralization improvement, despite all models reaching NC and demonstrating\nsimilar optimization performance on train set. We refer to this newly\ndiscovered property as \"non-conservative generalization\". In experiments, we\nalso provide empirical observations to verify the indications suggested by our\ntheoretical results.\n","authors":["Peifeng Gao","Qianqian Xu","Yibo Yang","Peisong Wen","Huiyang Shao","Zhiyong Yang","Bernard Ghanem","Qingming Huang"],"pdf_url":"https://arxiv.org/pdf/2310.08358v1.pdf","comment":"20 pages, 6 figures. arXiv admin note: substantial text overlap with\n arXiv:2304.08914"},{"id":"http://arxiv.org/abs/2310.06970v2","updated":"2023-10-12T14:21:52Z","published":"2023-10-10T19:47:58Z","title":"Flood and Echo: Algorithmic Alignment of GNNs with Distributed Computing","summary":" Graph Neural Networks are a natural fit for learning algorithms. They can\ndirectly represent tasks through an abstract but versatile graph structure and\nhandle inputs of different sizes. This opens up the possibility for scaling and\nextrapolation to larger graphs, one of the most important advantages of an\nalgorithm. However, this raises two core questions i) How can we enable nodes\nto gather the required information in a given graph ($\\textit{information\nexchange}$), even if is far away and ii) How can we design an execution\nframework which enables this information exchange for extrapolation to larger\ngraph sizes ($\\textit{algorithmic alignment for extrapolation}$). We propose a\nnew execution framework that is inspired by the design principles of\ndistributed algorithms: Flood and Echo Net. It propagates messages through the\nentire graph in a wave like activation pattern, which naturally generalizes to\nlarger instances. Through its sparse but parallel activations it is provably\nmore efficient in terms of message complexity. We study the proposed model and\nprovide both empirical evidence and theoretical insights in terms of its\nexpressiveness, efficiency, information exchange and ability to extrapolate.\n","authors":["Joël Mathys","Florian Grötschla","Kalyan Varma Nadimpalli","Roger Wattenhofer"],"pdf_url":"https://arxiv.org/pdf/2310.06970v2.pdf","comment":"9 pages"},{"id":"http://arxiv.org/abs/2310.08348v1","updated":"2023-10-12T14:18:09Z","published":"2023-10-12T14:18:09Z","title":"LightZero: A Unified Benchmark for Monte Carlo Tree Search in General\n Sequential Decision Scenarios","summary":" Building agents based on tree-search planning capabilities with learned\nmodels has achieved remarkable success in classic decision-making problems,\nsuch as Go and Atari. However, it has been deemed challenging or even\ninfeasible to extend Monte Carlo Tree Search (MCTS) based algorithms to diverse\nreal-world applications, especially when these environments involve complex\naction spaces and significant simulation costs, or inherent stochasticity. In\nthis work, we introduce LightZero, the first unified benchmark for deploying\nMCTS/MuZero in general sequential decision scenarios. Specificially, we\nsummarize the most critical challenges in designing a general MCTS-style\ndecision-making solver, then decompose the tightly-coupled algorithm and system\ndesign of tree-search RL methods into distinct sub-modules. By incorporating\nmore appropriate exploration and optimization strategies, we can significantly\nenhance these sub-modules and construct powerful LightZero agents to tackle\ntasks across a wide range of domains, such as board games, Atari, MuJoCo,\nMiniGrid and GoBigger. Detailed benchmark results reveal the significant\npotential of such methods in building scalable and efficient decision\nintelligence. The code is available as part of OpenDILab at\nhttps://github.com/opendilab/LightZero.\n","authors":["Yazhe Niu","Yuan Pu","Zhenjie Yang","Xueyan Li","Tong Zhou","Jiyuan Ren","Shuai Hu","Hongsheng Li","Yu Liu"],"pdf_url":"https://arxiv.org/pdf/2310.08348v1.pdf","comment":"NeurIPS 2023 Spotlight"},{"id":"http://arxiv.org/abs/2306.14041v2","updated":"2023-10-12T14:16:15Z","published":"2023-06-24T19:22:01Z","title":"Smoothed $f$-Divergence Distributionally Robust Optimization","summary":" In data-driven optimization, sample average approximation (SAA) is known to\nsuffer from the so-called optimizer's curse that causes an over-optimistic\nevaluation of the solution performance. We argue that a special type of\ndistributionallly robust optimization (DRO) formulation offers theoretical\nadvantages in correcting for this optimizer's curse compared to simple\n``margin'' adjustments to SAA and other DRO approaches: It attains a\nstatistical bound on the out-of-sample performance, for a wide class of\nobjective functions and distributions, that is nearly tightest in terms of\nexponential decay rate. This DRO uses an ambiguity set based on a Kullback\nLeibler (KL) divergence smoothed by the Wasserstein or L\\'evy-Prokhorov (LP)\ndistance via a suitable distance optimization. Computationally, we also show\nthat such a DRO, and its generalized versions using smoothed $f$-divergence,\nare not harder than DRO problems based on $f$-divergence or Wasserstein\ndistances, rendering our DRO formulations both statistically optimal and\ncomputationally viable.\n","authors":["Zhenyuan Liu","Bart P. G. Van Parys","Henry Lam"],"pdf_url":"https://arxiv.org/pdf/2306.14041v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07579v2","updated":"2023-10-12T14:15:24Z","published":"2023-10-11T15:19:31Z","title":"In-Context Unlearning: Language Models as Few Shot Unlearners","summary":" Machine unlearning, the study of efficiently removing the impact of specific\ntraining points on the trained model, has garnered increased attention of late,\ndriven by the need to comply with privacy regulations like the Right to be\nForgotten. Although unlearning is particularly relevant for LLMs in light of\nthe copyright issues they raise, achieving precise unlearning is\ncomputationally infeasible for very large models. To this end, recent work has\nproposed several algorithms which approximate the removal of training data\nwithout retraining the model. These algorithms crucially rely on access to the\nmodel parameters in order to update them, an assumption that may not hold in\npractice due to computational constraints or when the LLM is accessed via API.\nIn this work, we propose a new class of unlearning methods for LLMs we call\n''In-Context Unlearning'', providing inputs in context and without having to\nupdate model parameters. To unlearn a particular training instance, we provide\nthe instance alongside a flipped label and additional correctly labelled\ninstances which are prepended as inputs to the LLM at inference time. Our\nexperimental results demonstrate that these contexts effectively remove\nspecific information from the training set while maintaining performance levels\nthat are competitive with (or in some cases exceed) state-of-the-art unlearning\nmethods that require access to the LLM parameters.\n","authors":["Martin Pawelczyk","Seth Neel","Himabindu Lakkaraju"],"pdf_url":"https://arxiv.org/pdf/2310.07579v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.00152v4","updated":"2023-10-12T14:05:03Z","published":"2023-04-29T02:27:42Z","title":"Limits of Model Selection under Transfer Learning","summary":" Theoretical studies on transfer learning or domain adaptation have so far\nfocused on situations with a known hypothesis class or model; however in\npractice, some amount of model selection is usually involved, often appearing\nunder the umbrella term of hyperparameter-tuning: for example, one may think of\nthe problem of tuning for the right neural network architecture towards a\ntarget task, while leveraging data from a related source task.\n Now, in addition to the usual tradeoffs on approximation vs estimation errors\ninvolved in model selection, this problem brings in a new complexity term,\nnamely, the transfer distance between source and target distributions, which is\nknown to vary with the choice of hypothesis class.\n We present a first study of this problem, focusing on classification; in\nparticular, the analysis reveals some remarkable phenomena: adaptive rates,\ni.e., those achievable with no distributional information, can be arbitrarily\nslower than oracle rates, i.e., when given knowledge on distances.\n","authors":["Steve Hanneke","Samory Kpotufe","Yasaman Mahdaviyeh"],"pdf_url":"https://arxiv.org/pdf/2305.00152v4.pdf","comment":"Accepted for presentation at the Conference on Learning Theory (COLT)\n 2023"},{"id":"http://arxiv.org/abs/2310.08339v1","updated":"2023-10-12T13:57:32Z","published":"2023-10-12T13:57:32Z","title":"A Generic Software Framework for Distributed Topological Analysis\n Pipelines","summary":" This system paper presents a software framework for the support of\ntopological analysis pipelines in a distributed-memory model. While several\nrecent papers introduced topology-based approaches for distributed-memory\nenvironments, these were reporting experiments obtained with tailored,\nmono-algorithm implementations. In contrast, we describe in this paper a\ngeneral-purpose, generic framework for topological analysis pipelines, i.e. a\nsequence of topological algorithms interacting together, possibly on distinct\nnumbers of processes. Specifically, we instantiated our framework with the MPI\nmodel, within the Topology ToolKit (TTK). While developing this framework, we\nfaced several algorithmic and software engineering challenges, which we\ndocument in this paper. We provide a taxonomy for the distributed-memory\ntopological algorithms supported by TTK, depending on their communication needs\nand provide examples of hybrid MPI+thread parallelizations. Detailed\nperformance analyses show that parallel efficiencies range from $20\\%$ to\n$80\\%$ (depending on the algorithms), and that the MPI-specific preconditioning\nintroduced by our framework induces a negligible computation time overhead. We\nillustrate the new distributed-memory capabilities of TTK with an example of\nadvanced analysis pipeline, combining multiple algorithms, run on the largest\npublicly available dataset we have found (120 billion vertices) on a standard\ncluster with 64 nodes (for a total of 1,536 cores). Finally, we provide a\nroadmap for the completion of TTK's MPI extension, along with generic\nrecommendations for each algorithm communication category.\n","authors":["Eve Le Guillou","Michael Will","Pierre Guillou","Jonas Lukasczyk","Pierre Fortin","Christoph Garth","Julien Tierny"],"pdf_url":"https://arxiv.org/pdf/2310.08339v1.pdf","comment":"18 pages, 12 figures"},{"id":"http://arxiv.org/abs/2310.08337v1","updated":"2023-10-12T13:54:55Z","published":"2023-10-12T13:54:55Z","title":"Neural Diffusion Models","summary":" Diffusion models have shown remarkable performance on many generative tasks.\nDespite recent success, most diffusion models are restricted in that they only\nallow linear transformation of the data distribution. In contrast, broader\nfamily of transformations can potentially help train generative distributions\nmore efficiently, simplifying the reverse process and closing the gap between\nthe true negative log-likelihood and the variational approximation. In this\npaper, we present Neural Diffusion Models (NDMs), a generalization of\nconventional diffusion models that enables defining and learning time-dependent\nnon-linear transformations of data. We show how to optimise NDMs using a\nvariational bound in a simulation-free setting. Moreover, we derive a\ntime-continuous formulation of NDMs, which allows fast and reliable inference\nusing off-the-shelf numerical ODE and SDE solvers. Finally, we demonstrate the\nutility of NDMs with learnable transformations through experiments on standard\nimage generation benchmarks, including CIFAR-10, downsampled versions of\nImageNet and CelebA-HQ. NDMs outperform conventional diffusion models in terms\nof likelihood and produce high-quality samples.\n","authors":["Grigory Bartosh","Dmitry Vetrov","Christian A. Naesseth"],"pdf_url":"https://arxiv.org/pdf/2310.08337v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08331v1","updated":"2023-10-12T13:45:33Z","published":"2023-10-12T13:45:33Z","title":"Impact of multi-armed bandit strategies on deep recurrent reinforcement\n learning","summary":" Incomplete knowledge of the environment leads an agent to make decisions\nunder uncertainty. One of the major dilemmas in Reinforcement Learning (RL)\nwhere an autonomous agent has to balance two contrasting needs in making its\ndecisions is: exploiting the current knowledge of the environment to maximize\nthe cumulative reward as well as exploring actions that allow improving the\nknowledge of the environment, hopefully leading to higher reward values\n(exploration-exploitation trade-off). Concurrently, another relevant issue\nregards the full observability of the states, which may not be assumed in all\napplications. Such as when only 2D images are considered as input in a RL\napproach used for finding the optimal action within a 3D simulation\nenvironment. In this work, we address these issues by deploying and testing\nseveral techniques to balance exploration and exploitation trade-off on\npartially observable systems for predicting steering wheels in autonomous\ndriving scenario. More precisely, the final aim is to investigate the effects\nof using both stochastic and deterministic multi-armed bandit strategies\ncoupled with a Deep Recurrent Q-Network. Additionally, we adapted and evaluated\nthe impact of an innovative method to improve the learning phase of the\nunderlying Convolutional Recurrent Neural Network. We aim to show that adaptive\nstochastic methods for exploration better approximate the trade-off between\nexploration and exploitation as, in general, Softmax and Max-Boltzmann\nstrategies are able to outperform epsilon-greedy techniques.\n","authors":["Valentina Zangirolami","Matteo Borrotti"],"pdf_url":"https://arxiv.org/pdf/2310.08331v1.pdf","comment":"26 pages"},{"id":"http://arxiv.org/abs/2309.03004v2","updated":"2023-10-12T13:36:41Z","published":"2023-09-06T13:48:40Z","title":"A Theoretical Explanation of Activation Sparsity through Flat Minima and\n Adversarial Robustness","summary":" A recent empirical observation (Li et al., 2022b) of activation sparsity in\nMLP blocks offers an opportunity to drastically reduce computation costs for\nfree. Although having attributed it to training dynamics, existing theoretical\nexplanations of activation sparsity are restricted to shallow networks, small\ntraining steps and special training, despite its emergence in deep models\nstandardly trained for a large number of steps. To fill these gaps, we propose\nthe notion of gradient sparsity as one source of activation sparsity and a\ntheoretical explanation based on it that sees sparsity a necessary step to\nadversarial robustness w.r.t. hidden features and parameters, which is\napproximately the flatness of minima for well-learned models. The theory\napplies to standardly trained LayerNorm-ed MLPs, and further to Transformers or\nother architectures trained with weight noises. Eliminating other sources of\nflatness except for sparsity, we discover the phenomenon that the ratio between\nthe largest and smallest non-zero singular values of weight matrices is small.\nWhen discussing the emergence of this spectral concentration, we use random\nmatrix theory (RMT) as a powerful tool to analyze stochastic gradient noises.\nValidational experiments are conducted to verify our gradient-sparsity-based\nexplanation. We propose two plug-and-play modules for both training and\nfinetuning for sparsity. Experiments on ImageNet-1k and C4 demonstrate their\n50% sparsity improvements, indicating further potential cost reduction in both\ntraining and inference.\n","authors":["Ze Peng","Lei Qi","Yinghuan Shi","Yang Gao"],"pdf_url":"https://arxiv.org/pdf/2309.03004v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08320v1","updated":"2023-10-12T13:33:04Z","published":"2023-10-12T13:33:04Z","title":"Defending Our Privacy With Backdoors","summary":" The proliferation of large AI models trained on uncurated, often sensitive\nweb-scraped data has raised significant privacy concerns. One of the concerns\nis that adversaries can extract information about the training data using\nprivacy attacks. Unfortunately, the task of removing specific information from\nthe models without sacrificing performance is not straightforward and has\nproven to be challenging. We propose a rather easy yet effective defense based\non backdoor attacks to remove private information such as names of individuals\nfrom models, and focus in this work on text encoders. Specifically, through\nstrategic insertion of backdoors, we align the embeddings of sensitive phrases\nwith those of neutral terms-\"a person\" instead of the person's name. Our\nempirical results demonstrate the effectiveness of our backdoor-based defense\non CLIP by assessing its performance using a specialized privacy attack for\nzero-shot classifiers. Our approach provides not only a new \"dual-use\"\nperspective on backdoor attacks, but also presents a promising avenue to\nenhance the privacy of individuals within models trained on uncurated\nweb-scraped data.\n","authors":["Dominik Hintersdorf","Lukas Struppek","Daniel Neider","Kristian Kersting"],"pdf_url":"https://arxiv.org/pdf/2310.08320v1.pdf","comment":"14 pages, 4 figures"},{"id":"http://arxiv.org/abs/2310.08312v1","updated":"2023-10-12T13:20:17Z","published":"2023-10-12T13:20:17Z","title":"GePSAn: Generative Procedure Step Anticipation in Cooking Videos","summary":" We study the problem of future step anticipation in procedural videos. Given\na video of an ongoing procedural activity, we predict a plausible next\nprocedure step described in rich natural language. While most previous work\nfocus on the problem of data scarcity in procedural video datasets, another\ncore challenge of future anticipation is how to account for multiple plausible\nfuture realizations in natural settings. This problem has been largely\noverlooked in previous work. To address this challenge, we frame future step\nprediction as modelling the distribution of all possible candidates for the\nnext step. Specifically, we design a generative model that takes a series of\nvideo clips as input, and generates multiple plausible and diverse candidates\n(in natural language) for the next step. Following previous work, we side-step\nthe video annotation scarcity by pretraining our model on a large text-based\ncorpus of procedural activities, and then transfer the model to the video\ndomain. Our experiments, both in textual and video domains, show that our model\ncaptures diversity in the next step prediction and generates multiple plausible\nfuture predictions. Moreover, our model establishes new state-of-the-art\nresults on YouCookII, where it outperforms existing baselines on the next step\nanticipation. Finally, we also show that our model can successfully transfer\nfrom text to the video domain zero-shot, ie, without fine-tuning or adaptation,\nand produces good-quality future step predictions from video.\n","authors":["Mohamed Ashraf Abdelsalam","Samrudhdhi B. Rangrej","Isma Hadji","Nikita Dvornik","Konstantinos G. Derpanis","Afsaneh Fazly"],"pdf_url":"https://arxiv.org/pdf/2310.08312v1.pdf","comment":"published at ICCV 2023"},{"id":"http://arxiv.org/abs/2310.08304v1","updated":"2023-10-12T13:11:38Z","published":"2023-10-12T13:11:38Z","title":"CHIP: Contrastive Hierarchical Image Pretraining","summary":" Few-shot object classification is the task of classifying objects in an image\nwith limited number of examples as supervision. We propose a one-shot/few-shot\nclassification model that can classify an object of any unseen class into a\nrelatively general category in an hierarchically based classification. Our\nmodel uses a three-level hierarchical contrastive loss based ResNet152\nclassifier for classifying an object based on its features extracted from Image\nembedding, not used during the training phase. For our experimentation, we have\nused a subset of the ImageNet (ILSVRC-12) dataset that contains only the animal\nclasses for training our model and created our own dataset of unseen classes\nfor evaluating our trained model. Our model provides satisfactory results in\nclassifying the unknown objects into a generic category which has been later\ndiscussed in greater detail.\n","authors":["Arpit Mittal","Harshil Jhaveri","Swapnil Mallick","Abhishek Ajmera"],"pdf_url":"https://arxiv.org/pdf/2310.08304v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.07189v3","updated":"2023-10-12T12:47:22Z","published":"2023-03-13T15:30:28Z","title":"Optimizing Convolutional Neural Networks for Chronic Obstructive\n Pulmonary Disease Detection in Clinical Computed Tomography Imaging","summary":" We aim to optimize the binary detection of Chronic Obstructive Pulmonary\nDisease (COPD) based on emphysema presence in the lung with convolutional\nneural networks (CNN) by exploring manually adjusted versus automated\nwindow-setting optimization (WSO) on computed tomography (CT) images. 7,194 CT\nimages (3,597 with COPD; 3,597 healthy controls) from 78 subjects (43 with\nCOPD; 35 healthy controls) were selected retrospectively (10.2018-12.2019) and\npreprocessed. For each image, intensity values were manually clipped to the\nemphysema window setting and a baseline 'full-range' window setting.\nClass-balanced train, validation, and test sets contained 3,392, 1,114, and\n2,688 images. The network backbone was optimized by comparing various CNN\narchitectures. Furthermore, automated WSO was implemented by adding a\ncustomized layer to the model. The image-level area under the Receiver\nOperating Characteristics curve (AUC) [lower, upper limit 95% confidence] was\nutilized to compare model variations. Repeated inference (n=7) on the test set\nshowed that the DenseNet was the most efficient backbone and achieved a mean\nAUC of 0.80 [0.76, 0.85] without WSO. Comparably, with input images manually\nadjusted to the emphysema window, the DenseNet model predicted COPD with a mean\nAUC of 0.86 [0.82, 0.89]. By adding a customized WSO layer to the DenseNet, an\noptimal window in the proximity of the emphysema window setting was learned\nautomatically, and a mean AUC of 0.82 [0.78, 0.86] was achieved. Detection of\nCOPD with DenseNet models was improved by WSO of CT data to the emphysema\nwindow setting range.\n","authors":["Tina Dorosti","Manuel Schultheiss","Felix Hofmann","Johannes Thalhammer","Luisa Kirchner","Theresa Urban","Franz Pfeiffer","Florian Schaff","Tobias Lasser","Daniela Pfeiffer"],"pdf_url":"https://arxiv.org/pdf/2303.07189v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08287v1","updated":"2023-10-12T12:45:13Z","published":"2023-10-12T12:45:13Z","title":"A Symmetry-Aware Exploration of Bayesian Neural Network Posteriors","summary":" The distribution of the weights of modern deep neural networks (DNNs) -\ncrucial for uncertainty quantification and robustness - is an eminently complex\nobject due to its extremely high dimensionality. This paper proposes one of the\nfirst large-scale explorations of the posterior distribution of deep Bayesian\nNeural Networks (BNNs), expanding its study to real-world vision tasks and\narchitectures. Specifically, we investigate the optimal approach for\napproximating the posterior, analyze the connection between posterior quality\nand uncertainty quantification, delve into the impact of modes on the\nposterior, and explore methods for visualizing the posterior. Moreover, we\nuncover weight-space symmetries as a critical aspect for understanding the\nposterior. To this extent, we develop an in-depth assessment of the impact of\nboth permutation and scaling symmetries that tend to obfuscate the Bayesian\nposterior. While the first type of transformation is known for duplicating\nmodes, we explore the relationship between the latter and L2 regularization,\nchallenging previous misconceptions. Finally, to help the community improve our\nunderstanding of the Bayesian posterior, we will shortly release the first\nlarge-scale checkpoint dataset, including thousands of real-world models and\nour codes.\n","authors":["Olivier Laurent","Emanuel Aldea","Gianni Franchi"],"pdf_url":"https://arxiv.org/pdf/2310.08287v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08282v1","updated":"2023-10-12T12:39:08Z","published":"2023-10-12T12:39:08Z","title":"Data driven modeling of self-similar dynamics","summary":" Multiscale modeling of complex systems is crucial for understanding their\nintricacies. Data-driven multiscale modeling has emerged as a promising\napproach to tackle challenges associated with complex systems. On the other\nhand, self-similarity is prevalent in complex systems, hinting that large-scale\ncomplex systems can be modeled at a reduced cost. In this paper, we introduce a\nmultiscale neural network framework that incorporates self-similarity as prior\nknowledge, facilitating the modeling of self-similar dynamical systems. For\ndeterministic dynamics, our framework can discern whether the dynamics are\nself-similar. For uncertain dynamics, it can compare and determine which\nparameter set is closer to self-similarity. The framework allows us to extract\nscale-invariant kernels from the dynamics for modeling at any scale. Moreover,\nour method can identify the power law exponents in self-similar systems.\nPreliminary tests on the Ising model yielded critical exponents consistent with\ntheoretical expectations, providing valuable insights for addressing critical\nphase transitions in non-equilibrium systems.\n","authors":["Ruyi Tao","Ningning Tao","Yizhuang You","Jiang Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.08282v1.pdf","comment":"10 pages,4 figures,1 table"},{"id":"http://arxiv.org/abs/2310.08278v1","updated":"2023-10-12T12:29:32Z","published":"2023-10-12T12:29:32Z","title":"Lag-Llama: Towards Foundation Models for Time Series Forecasting","summary":" Aiming to build foundation models for time-series forecasting and study their\nscaling behavior, we present here our work-in-progress on Lag-Llama, a\ngeneral-purpose univariate probabilistic time-series forecasting model trained\non a large collection of time-series data. The model shows good zero-shot\nprediction capabilities on unseen \"out-of-distribution\" time-series datasets,\noutperforming supervised baselines. We use smoothly broken power-laws to fit\nand predict model scaling behavior. The open source code is made available at\nhttps://github.com/kashif/pytorch-transformer-ts.\n","authors":["Kashif Rasul","Arjun Ashok","Andrew Robert Williams","Arian Khorasani","George Adamopoulos","Rishika Bhagwatkar","Marin Biloš","Hena Ghonia","Nadhir Vincent Hassen","Anderson Schneider","Sahil Garg","Alexandre Drouin","Nicolas Chapados","Yuriy Nevmyvaka","Irina Rish"],"pdf_url":"https://arxiv.org/pdf/2310.08278v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.15394v2","updated":"2023-10-12T12:11:23Z","published":"2023-05-24T17:56:18Z","title":"Differentially-Private Decision Trees and Provable Robustness to Data\n Poisoning","summary":" Decision trees are interpretable models that are well-suited to non-linear\nlearning problems. Much work has been done on extending decision tree learning\nalgorithms with differential privacy, a system that guarantees the privacy of\nsamples within the training data. However, current state-of-the-art algorithms\nfor this purpose sacrifice much utility for a small privacy benefit. These\nsolutions create random decision nodes that reduce decision tree accuracy or\nspend an excessive share of the privacy budget on labeling leaves. Moreover,\nmany works do not support continuous features or leak information about them.\nWe propose a new method called PrivaTree based on private histograms that\nchooses good splits while consuming a small privacy budget. The resulting trees\nprovide a significantly better privacy-utility trade-off and accept mixed\nnumerical and categorical data without leaking information about numerical\nfeatures. Finally, while it is notoriously hard to give robustness guarantees\nagainst data poisoning attacks, we demonstrate bounds for the expected accuracy\nand success rates of backdoor attacks against differentially-private learners.\nBy leveraging the better privacy-utility trade-off of PrivaTree we are able to\ntrain decision trees with significantly better robustness against backdoor\nattacks compared to regular decision trees and with meaningful theoretical\nguarantees.\n","authors":["Daniël Vos","Jelle Vos","Tianyu Li","Zekeriya Erkin","Sicco Verwer"],"pdf_url":"https://arxiv.org/pdf/2305.15394v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08259v1","updated":"2023-10-12T12:05:51Z","published":"2023-10-12T12:05:51Z","title":"Invisible Threats: Backdoor Attack in OCR Systems","summary":" Optical Character Recognition (OCR) is a widely used tool to extract text\nfrom scanned documents. Today, the state-of-the-art is achieved by exploiting\ndeep neural networks. However, the cost of this performance is paid at the\nprice of system vulnerability. For instance, in backdoor attacks, attackers\ncompromise the training phase by inserting a backdoor in the victim's model\nthat will be activated at testing time by specific patterns while leaving the\noverall model performance intact. This work proposes a backdoor attack for OCR\nresulting in the injection of non-readable characters from malicious input\nimages. This simple but effective attack exposes the state-of-the-art OCR\nweakness, making the extracted text correct to human eyes but simultaneously\nunusable for the NLP application that uses OCR as a preprocessing step.\nExperimental results show that the attacked models successfully output\nnon-readable characters for around 90% of the poisoned instances without\nharming their performance for the remaining instances.\n","authors":["Mauro Conti","Nicola Farronato","Stefanos Koffas","Luca Pajola","Stjepan Picek"],"pdf_url":"https://arxiv.org/pdf/2310.08259v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.19443v2","updated":"2023-10-12T12:02:34Z","published":"2023-05-30T22:34:48Z","title":"OWAdapt: An adaptive loss function for deep learning using OWA operators","summary":" In this paper, we propose a fuzzy adaptive loss function for enhancing deep\nlearning performance in classification tasks. Specifically, we redefine the\ncross-entropy loss to effectively address class-level noise conditions,\nincluding the challenging problem of class imbalance. Our approach introduces\naggregation operators, leveraging the power of fuzzy logic to improve\nclassification accuracy. The rationale behind our proposed method lies in the\niterative up-weighting of class-level components within the loss function,\nfocusing on those with larger errors. To achieve this, we employ the ordered\nweighted average (OWA) operator and combine it with an adaptive scheme for\ngradient-based learning. Through extensive experimentation, our method\noutperforms other commonly used loss functions, such as the standard\ncross-entropy or focal loss, across various binary and multiclass\nclassification tasks. Furthermore, we explore the influence of hyperparameters\nassociated with the OWA operators and present a default configuration that\nperforms well across different experimental settings.\n","authors":["Sebastián Maldonado","Carla Vairetti","Katherine Jara","Miguel Carrasco","Julio López"],"pdf_url":"https://arxiv.org/pdf/2305.19443v2.pdf","comment":"15 pages, 1 figure, published"},{"id":"http://arxiv.org/abs/2310.08256v1","updated":"2023-10-12T12:01:32Z","published":"2023-10-12T12:01:32Z","title":"Impact of Co-occurrence on Factual Knowledge of Large Language Models","summary":" Large language models (LLMs) often make factually incorrect responses despite\ntheir success in various applications. In this paper, we hypothesize that\nrelying heavily on simple co-occurrence statistics of the pre-training corpora\nis one of the main factors that cause factual errors. Our results reveal that\nLLMs are vulnerable to the co-occurrence bias, defined as preferring frequently\nco-occurred words over the correct answer. Consequently, LLMs struggle to\nrecall facts whose subject and object rarely co-occur in the pre-training\ndataset although they are seen during finetuning. We show that co-occurrence\nbias remains despite scaling up model sizes or finetuning. Therefore, we\nsuggest finetuning on a debiased dataset to mitigate the bias by filtering out\nbiased samples whose subject-object co-occurrence count is high. Although\ndebiased finetuning allows LLMs to memorize rare facts in the training set, it\nis not effective in recalling rare facts unseen during finetuning. Further\nresearch in mitigation will help build reliable language models by preventing\npotential errors. The code is available at\n\\url{https://github.com/CheongWoong/impact_of_cooccurrence}.\n","authors":["Cheongwoong Kang","Jaesik Choi"],"pdf_url":"https://arxiv.org/pdf/2310.08256v1.pdf","comment":"EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2304.12233v3","updated":"2023-10-12T12:00:09Z","published":"2023-04-20T10:45:57Z","title":"Diffusion-based Generative AI for Exploring Transition States from 2D\n Molecular Graphs","summary":" The exploration of transition state (TS) geometries is crucial for\nelucidating chemical reaction mechanisms and modeling their kinetics. Recently,\nmachine learning (ML) models have shown remarkable performance for prediction\nof TS geometries. However, they require 3D conformations of reactants and\nproducts often with their appropriate orientations as input, which demands\nsubstantial efforts and computational cost. Here, we propose a generative\napproach based on the stochastic diffusion method, namely TSDiff, for\nprediction of TS geometries just from 2D molecular graphs. TSDiff outperformed\nthe existing ML models with 3D geometries in terms of both accuracy and\nefficiency. Moreover, it enables to sample various TS conformations, because it\nlearned the distribution of TS geometries for diverse reactions in training.\nThus, TSDiff was able to find more favorable reaction pathways with lower\nbarrier heights than those in the reference database. These results demonstrate\nthat TSDiff shows promising potential for an efficient and reliable TS\nexploration.\n","authors":["Seonghwan Kim","Jeheon Woo","Woo Youn Kim"],"pdf_url":"https://arxiv.org/pdf/2304.12233v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.00327v2","updated":"2023-10-12T11:55:28Z","published":"2023-09-30T10:06:05Z","title":"Memorization with neural nets: going beyond the worst case","summary":" In practice, deep neural networks are often able to easily interpolate their\ntraining data. To understand this phenomenon, many works have aimed to quantify\nthe memorization capacity of a neural network architecture: the largest number\nof points such that the architecture can interpolate any placement of these\npoints with any assignment of labels. For real-world data, however, one\nintuitively expects the presence of a benign structure so that interpolation\nalready occurs at a smaller network size than suggested by memorization\ncapacity. In this paper, we investigate interpolation by adopting an\ninstance-specific viewpoint. We introduce a simple randomized algorithm that,\ngiven a fixed finite dataset with two classes, with high probability constructs\nan interpolating three-layer neural network in polynomial time. The required\nnumber of parameters is linked to geometric properties of the two classes and\ntheir mutual arrangement. As a result, we obtain guarantees that are\nindependent of the number of samples and hence move beyond worst-case\nmemorization capacity bounds. We illustrate the effectiveness of the algorithm\nin non-pathological situations with extensive numerical experiments and link\nthe insights back to the theoretical results.\n","authors":["Sjoerd Dirksen","Patrick Finke","Martin Genzel"],"pdf_url":"https://arxiv.org/pdf/2310.00327v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08252v1","updated":"2023-10-12T11:55:17Z","published":"2023-10-12T11:55:17Z","title":"MetaBox: A Benchmark Platform for Meta-Black-Box Optimization with\n Reinforcement Learning","summary":" Recently, Meta-Black-Box Optimization with Reinforcement Learning\n(MetaBBO-RL) has showcased the power of leveraging RL at the meta-level to\nmitigate manual fine-tuning of low-level black-box optimizers. However, this\nfield is hindered by the lack of a unified benchmark. To fill this gap, we\nintroduce MetaBox, the first benchmark platform expressly tailored for\ndeveloping and evaluating MetaBBO-RL methods. MetaBox offers a flexible\nalgorithmic template that allows users to effortlessly implement their unique\ndesigns within the platform. Moreover, it provides a broad spectrum of over 300\nproblem instances, collected from synthetic to realistic scenarios, and an\nextensive library of 19 baseline methods, including both traditional black-box\noptimizers and recent MetaBBO-RL methods. Besides, MetaBox introduces three\nstandardized performance metrics, enabling a more thorough assessment of the\nmethods. In a bid to illustrate the utility of MetaBox for facilitating\nrigorous evaluation and in-depth analysis, we carry out a wide-ranging\nbenchmarking study on existing MetaBBO-RL methods. Our MetaBox is open-source\nand accessible at: https://github.com/GMC-DRL/MetaBox.\n","authors":["Zeyuan Ma","Hongshu Guo","Jiacheng Chen","Zhenrui Li","Guojun Peng","Yue-Jiao Gong","Yining Ma","Zhiguang Cao"],"pdf_url":"https://arxiv.org/pdf/2310.08252v1.pdf","comment":"Accepted at NuerIPS 2023"},{"id":"http://arxiv.org/abs/2110.03469v3","updated":"2023-10-12T11:53:59Z","published":"2021-10-07T13:49:23Z","title":"Federated Learning from Small Datasets","summary":" Federated learning allows multiple parties to collaboratively train a joint\nmodel without sharing local data. This enables applications of machine learning\nin settings of inherently distributed, undisclosable data such as in the\nmedical domain. In practice, joint training is usually achieved by aggregating\nlocal models, for which local training objectives have to be in expectation\nsimilar to the joint (global) objective. Often, however, local datasets are so\nsmall that local objectives differ greatly from the global objective, resulting\nin federated learning to fail. We propose a novel approach that intertwines\nmodel aggregations with permutations of local models. The permutations expose\neach local model to a daisy chain of local datasets resulting in more efficient\ntraining in data-sparse domains. This enables training on extremely small local\ndatasets, such as patient data across hospitals, while retaining the training\nefficiency and privacy benefits of federated learning.\n","authors":["Michael Kamp","Jonas Fischer","Jilles Vreeken"],"pdf_url":"https://arxiv.org/pdf/2110.03469v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04263v3","updated":"2023-10-12T11:42:59Z","published":"2023-08-08T13:59:56Z","title":"BarlowRL: Barlow Twins for Data-Efficient Reinforcement Learning","summary":" This paper introduces BarlowRL, a data-efficient reinforcement learning agent\nthat combines the Barlow Twins self-supervised learning framework with DER\n(Data-Efficient Rainbow) algorithm. BarlowRL outperforms both DER and its\ncontrastive counterpart CURL on the Atari 100k benchmark. BarlowRL avoids\ndimensional collapse by enforcing information spread to the whole space. This\nhelps RL algorithms to utilize uniformly spread state representation that\neventually results in a remarkable performance. The integration of Barlow Twins\nwith DER enhances data efficiency and achieves superior performance in the RL\ntasks. BarlowRL demonstrates the potential of incorporating self-supervised\nlearning techniques to improve RL algorithms.\n","authors":["Omer Veysel Cagatan","Baris Akgun"],"pdf_url":"https://arxiv.org/pdf/2308.04263v3.pdf","comment":"ACML 2023, Camera-Ready Version"},{"id":"http://arxiv.org/abs/2304.00457v3","updated":"2023-10-12T11:35:35Z","published":"2023-04-02T05:47:09Z","title":"LLMMaps -- A Visual Metaphor for Stratified Evaluation of Large Language\n Models","summary":" Large Language Models (LLMs) have revolutionized natural language processing\nand demonstrated impressive capabilities in various tasks. Unfortunately, they\nare prone to hallucinations, where the model exposes incorrect or false\ninformation in its responses, which renders diligent evaluation approaches\nmandatory. While LLM performance in specific knowledge fields is often\nevaluated based on question and answer (Q&A) datasets, such evaluations usually\nreport only a single accuracy number for the dataset, which often covers an\nentire field. This field-based evaluation, is problematic with respect to\ntransparency and model improvement. A stratified evaluation could instead\nreveal subfields, where hallucinations are more likely to occur and thus help\nto better assess LLMs' risks and guide their further development. To support\nsuch stratified evaluations, we propose LLMMaps as a novel visualization\ntechnique that enables users to evaluate LLMs' performance with respect to Q&A\ndatasets. LLMMaps provide detailed insights into LLMs' knowledge capabilities\nin different subfields, by transforming Q&A datasets as well as LLM responses\ninto an internal knowledge structure. An extension for comparative\nvisualization furthermore, allows for the detailed comparison of multiple LLMs.\nTo assess LLMMaps we use them to conduct a comparative analysis of several\nstate-of-the-art LLMs, such as BLOOM, GPT-2, GPT-3, ChatGPT and LLaMa-13B, as\nwell as two qualitative user evaluations. All necessary source code and data\nfor generating LLMMaps to be used in scientific publications and elsewhere is\navailable on GitHub: https://github.com/viscom-ulm/LLMMaps\n","authors":["Patrik Puchert","Poonam Poonam","Christian van Onzenoodt","Timo Ropinski"],"pdf_url":"https://arxiv.org/pdf/2304.00457v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08237v1","updated":"2023-10-12T11:33:15Z","published":"2023-10-12T11:33:15Z","title":"Towards a Unified Analysis of Kernel-based Methods Under Covariate Shift","summary":" Covariate shift occurs prevalently in practice, where the input distributions\nof the source and target data are substantially different. Despite its\npractical importance in various learning problems, most of the existing methods\nonly focus on some specific learning tasks and are not well validated\ntheoretically and numerically. To tackle this problem, we propose a unified\nanalysis of general nonparametric methods in a reproducing kernel Hilbert space\n(RKHS) under covariate shift. Our theoretical results are established for a\ngeneral loss belonging to a rich loss function family, which includes many\ncommonly used methods as special cases, such as mean regression, quantile\nregression, likelihood-based classification, and margin-based classification.\nTwo types of covariate shift problems are the focus of this paper and the sharp\nconvergence rates are established for a general loss function to provide a\nunified theoretical analysis, which concurs with the optimal results in\nliterature where the squared loss is used. Extensive numerical studies on\nsynthetic and real examples confirm our theoretical findings and further\nillustrate the effectiveness of our proposed method.\n","authors":["Xingdong Feng","Xin He","Caixing Wang","Chao Wang","Jingnan Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.08237v1.pdf","comment":"Poster to appear in Thirty-seventh Conference on Neural Information\n Processing Systems"},{"id":"http://arxiv.org/abs/2310.08235v1","updated":"2023-10-12T11:31:01Z","published":"2023-10-12T11:31:01Z","title":"GROOT: Learning to Follow Instructions by Watching Gameplay Videos","summary":" We study the problem of building a controller that can follow open-ended\ninstructions in open-world environments. We propose to follow reference videos\nas instructions, which offer expressive goal specifications while eliminating\nthe need for expensive text-gameplay annotations. A new learning framework is\nderived to allow learning such instruction-following controllers from gameplay\nvideos while producing a video instruction encoder that induces a structured\ngoal space. We implement our agent GROOT in a simple yet effective\nencoder-decoder architecture based on causal transformers. We evaluate GROOT\nagainst open-world counterparts and human players on a proposed Minecraft\nSkillForge benchmark. The Elo ratings clearly show that GROOT is closing the\nhuman-machine gap as well as exhibiting a 70% winning rate over the best\ngeneralist agent baseline. Qualitative analysis of the induced goal space\nfurther demonstrates some interesting emergent properties, including the goal\ncomposition and complex gameplay behavior synthesis. Code and video can be\nfound on the website https://craftjarvis-groot.github.io.\n","authors":["Shaofei Cai","Bowei Zhang","Zihao Wang","Xiaojian Ma","Anji Liu","Yitao Liang"],"pdf_url":"https://arxiv.org/pdf/2310.08235v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2107.14432v4","updated":"2023-10-12T11:25:26Z","published":"2021-07-30T05:33:43Z","title":"Adaptive Optimizers with Sparse Group Lasso for Neural Networks in CTR\n Prediction","summary":" We develop a novel framework that adds the regularizers of the sparse group\nlasso to a family of adaptive optimizers in deep learning, such as Momentum,\nAdagrad, Adam, AMSGrad, AdaHessian, and create a new class of optimizers, which\nare named Group Momentum, Group Adagrad, Group Adam, Group AMSGrad and Group\nAdaHessian, etc., accordingly. We establish theoretically proven convergence\nguarantees in the stochastic convex settings, based on primal-dual methods. We\nevaluate the regularized effect of our new optimizers on three large-scale\nreal-world ad click datasets with state-of-the-art deep learning models. The\nexperimental results reveal that compared with the original optimizers with the\npost-processing procedure which uses the magnitude pruning method, the\nperformance of the models can be significantly improved on the same sparsity\nlevel. Furthermore, in comparison to the cases without magnitude pruning, our\nmethods can achieve extremely high sparsity with significantly better or highly\ncompetitive performance. The code is available at\nhttps://github.com/intelligent-machine-learning/dlrover/blob/master/tfplus.\n","authors":["Yun Yue","Yongchao Liu","Suo Tong","Minghao Li","Zhen Zhang","Chunyang Wen","Huanjun Bao","Lihong Gu","Jinjie Gu","Yixiang Mu"],"pdf_url":"https://arxiv.org/pdf/2107.14432v4.pdf","comment":"24 pages. Published as a conference paper at ECML PKDD 2021. This\n version includes Appendix which was not included in the published version\n because of page limit"},{"id":"http://arxiv.org/abs/2310.08224v1","updated":"2023-10-12T11:16:57Z","published":"2023-10-12T11:16:57Z","title":"Emergence of Latent Binary Encoding in Deep Neural Network Classifiers","summary":" We observe the emergence of binary encoding within the latent space of\ndeep-neural-network classifiers. Such binary encoding is induced by introducing\na linear penultimate layer, which is equipped during training with a loss\nfunction that grows as $\\exp(\\vec{x}^2)$, where $\\vec{x}$ are the coordinates\nin the latent space. The phenomenon we describe represents a specific instance\nof a well-documented occurrence known as \\textit{neural collapse}, which arises\nin the terminal phase of training and entails the collapse of latent class\nmeans to the vertices of a simplex equiangular tight frame (ETF). We show that\nbinary encoding accelerates convergence toward the simplex ETF and enhances\nclassification accuracy.\n","authors":["Luigi Sbailò","Luca Ghiringhelli"],"pdf_url":"https://arxiv.org/pdf/2310.08224v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08221v1","updated":"2023-10-12T11:11:54Z","published":"2023-10-12T11:11:54Z","title":"SimCKP: Simple Contrastive Learning of Keyphrase Representations","summary":" Keyphrase generation (KG) aims to generate a set of summarizing words or\nphrases given a source document, while keyphrase extraction (KE) aims to\nidentify them from the text. Because the search space is much smaller in KE, it\nis often combined with KG to predict keyphrases that may or may not exist in\nthe corresponding document. However, current unified approaches adopt sequence\nlabeling and maximization-based generation that primarily operate at a token\nlevel, falling short in observing and scoring keyphrases as a whole. In this\nwork, we propose SimCKP, a simple contrastive learning framework that consists\nof two stages: 1) An extractor-generator that extracts keyphrases by learning\ncontext-aware phrase-level representations in a contrastive manner while also\ngenerating keyphrases that do not appear in the document; 2) A reranker that\nadapts scores for each generated phrase by likewise aligning their\nrepresentations with the corresponding document. Experimental results on\nmultiple benchmark datasets demonstrate the effectiveness of our proposed\napproach, which outperforms the state-of-the-art models by a significant\nmargin.\n","authors":["Minseok Choi","Chaeheon Gwak","Seho Kim","Si Hyeong Kim","Jaegul Choo"],"pdf_url":"https://arxiv.org/pdf/2310.08221v1.pdf","comment":"Accepted to Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2309.00848v2","updated":"2023-10-12T11:11:23Z","published":"2023-09-02T07:17:43Z","title":"Bengali Document Layout Analysis -- A YOLOV8 Based Ensembling Approach","summary":" This paper focuses on enhancing Bengali Document Layout Analysis (DLA) using\nthe YOLOv8 model and innovative post-processing techniques. We tackle\nchallenges unique to the complex Bengali script by employing data augmentation\nfor model robustness. After meticulous validation set evaluation, we fine-tune\nour approach on the complete dataset, leading to a two-stage prediction\nstrategy for accurate element segmentation. Our ensemble model, combined with\npost-processing, outperforms individual base architectures, addressing issues\nidentified in the BaDLAD dataset. By leveraging this approach, we aim to\nadvance Bengali document analysis, contributing to improved OCR and document\ncomprehension and BaDLAD serves as a foundational resource for this endeavor,\naiding future research in the field. Furthermore, our experiments provided key\ninsights to incorporate new strategies into the established solution.\n","authors":["Nazmus Sakib Ahmed","Saad Sakib Noor","Ashraful Islam Shanto Sikder","Abhijit Paul"],"pdf_url":"https://arxiv.org/pdf/2309.00848v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08217v1","updated":"2023-10-12T11:05:34Z","published":"2023-10-12T11:05:34Z","title":"TriRE: A Multi-Mechanism Learning Paradigm for Continual Knowledge\n Retention and Promotion","summary":" Continual learning (CL) has remained a persistent challenge for deep neural\nnetworks due to catastrophic forgetting (CF) of previously learned tasks.\nSeveral techniques such as weight regularization, experience rehearsal, and\nparameter isolation have been proposed to alleviate CF. Despite their relative\nsuccess, these research directions have predominantly remained orthogonal and\nsuffer from several shortcomings, while missing out on the advantages of\ncompeting strategies. On the contrary, the brain continually learns,\naccommodates, and transfers knowledge across tasks by simultaneously leveraging\nseveral neurophysiological processes, including neurogenesis, active\nforgetting, neuromodulation, metaplasticity, experience rehearsal, and\ncontext-dependent gating, rarely resulting in CF. Inspired by how the brain\nexploits multiple mechanisms concurrently, we propose TriRE, a novel CL\nparadigm that encompasses retaining the most prominent neurons for each task,\nrevising and solidifying the extracted knowledge of current and past tasks, and\nactively promoting less active neurons for subsequent tasks through rewinding\nand relearning. Across CL settings, TriRE significantly reduces task\ninterference and surpasses different CL approaches considered in isolation.\n","authors":["Preetha Vijayan","Prashant Bhat","Elahe Arani","Bahram Zonooz"],"pdf_url":"https://arxiv.org/pdf/2310.08217v1.pdf","comment":"Accepted at 37th Conference on Neural Information Processing Systems\n (NeurIPS 2023)"},{"id":"http://arxiv.org/abs/2310.08215v1","updated":"2023-10-12T11:04:17Z","published":"2023-10-12T11:04:17Z","title":"Trustworthy Machine Learning","summary":" As machine learning technology gets applied to actual products and solutions,\nnew challenges have emerged. Models unexpectedly fail to generalize to small\nchanges in the distribution, tend to be confident on novel data they have never\nseen, or cannot communicate the rationale behind their decisions effectively\nwith the end users. Collectively, we face a trustworthiness issue with the\ncurrent machine learning technology. This textbook on Trustworthy Machine\nLearning (TML) covers a theoretical and technical background of four key topics\nin TML: Out-of-Distribution Generalization, Explainability, Uncertainty\nQuantification, and Evaluation of Trustworthiness. We discuss important\nclassical and contemporary research papers of the aforementioned fields and\nuncover and connect their underlying intuitions. The book evolved from the\nhomonymous course at the University of T\\\"ubingen, first offered in the Winter\nSemester of 2022/23. It is meant to be a stand-alone product accompanied by\ncode snippets and various pointers to further sources on topics of TML. The\ndedicated website of the book is https://trustworthyml.io/.\n","authors":["Bálint Mucsányi","Michael Kirchhof","Elisa Nguyen","Alexander Rubinstein","Seong Joon Oh"],"pdf_url":"https://arxiv.org/pdf/2310.08215v1.pdf","comment":"373 pages, textbook at the University of T\\\"ubingen"},{"id":"http://arxiv.org/abs/2310.08209v1","updated":"2023-10-12T10:56:25Z","published":"2023-10-12T10:56:25Z","title":"Conformal inference for regression on Riemannian Manifolds","summary":" Regression on manifolds, and, more broadly, statistics on manifolds, has\ngarnered significant importance in recent years due to the vast number of\napplications for this type of data. Circular data is a classic example, but so\nis data in the space of covariance matrices, data on the Grassmannian manifold\nobtained as a result of principal component analysis, among many others. In\nthis work we investigate prediction sets for regression scenarios when the\nresponse variable, denoted by $Y$, resides in a manifold, and the covariable,\ndenoted by X, lies in Euclidean space. This extends the concepts delineated in\n[Lei and Wasserman, 2014] to this novel context. Aligning with traditional\nprinciples in conformal inference, these prediction sets are distribution-free,\nindicating that no specific assumptions are imposed on the joint distribution\nof $(X, Y)$, and they maintain a non-parametric character. We prove the\nasymptotic almost sure convergence of the empirical version of these regions on\nthe manifold to their population counterparts. The efficiency of this method is\nshown through a comprehensive simulation study and an analysis involving\nreal-world data.\n","authors":["Alejandro Cholaquidis","Fabrice Gamboa","Leonardo Moreno"],"pdf_url":"https://arxiv.org/pdf/2310.08209v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08204v1","updated":"2023-10-12T10:50:21Z","published":"2023-10-12T10:50:21Z","title":"Lifelong Audio-video Masked Autoencoder with Forget-robust Localized\n Alignments","summary":" We present a lifelong audio-video masked autoencoder that continually learns\nthe multimodal representations from a video stream containing audio-video\npairs, while its distribution continually shifts over time. Specifically, we\npropose two novel ideas to tackle the problem: (1) Localized Alignment: We\nintroduce a small trainable multimodal encoder that predicts the audio and\nvideo tokens that are well-aligned with each other. This allows the model to\nlearn only the highly correlated audiovisual patches with accurate multimodal\nrelationships. (2) Forget-robust multimodal patch selection: We compare the\nrelative importance of each audio-video patch between the current and past data\npair to mitigate unintended drift of the previously learned audio-video\nrepresentations. Our proposed method, FLAVA (Forget-robust Localized\nAudio-Video Alignment), therefore, captures the complex relationships between\nthe audio and video modalities during training on a sequence of pre-training\ntasks while alleviating the forgetting of learned audiovisual correlations. Our\nexperiments validate that FLAVA outperforms the state-of-the-art continual\nlearning methods on several benchmark datasets under continual audio-video\nrepresentation learning scenarios.\n","authors":["Jaewoo Lee","Jaehong Yoon","Wonjae Kim","Yunji Kim","Sung Ju Hwang"],"pdf_url":"https://arxiv.org/pdf/2310.08204v1.pdf","comment":"Preprint, project page: https://g-jwlee.github.io/FLAVA/"},{"id":"http://arxiv.org/abs/2310.08198v1","updated":"2023-10-12T10:44:47Z","published":"2023-10-12T10:44:47Z","title":"Beyond Traditional DoE: Deep Reinforcement Learning for Optimizing\n Experiments in Model Identification of Battery Dynamics","summary":" Model identification of battery dynamics is a central problem in energy\nresearch; many energy management systems and design processes rely on accurate\nbattery models for efficiency optimization. The standard methodology for\nbattery modelling is traditional design of experiments (DoE), where the battery\ndynamics are excited with many different current profiles and the measured\noutputs are used to estimate the system dynamics. However, although it is\npossible to obtain useful models with the traditional approach, the process is\ntime consuming and expensive because of the need to sweep many different\ncurrent-profile configurations. In the present work, a novel DoE approach is\ndeveloped based on deep reinforcement learning, which alters the configuration\nof the experiments on the fly based on the statistics of past experiments.\nInstead of sticking to a library of predefined current profiles, the proposed\napproach modifies the current profiles dynamically by updating the output space\ncovered by past measurements, hence only the current profiles that are\ninformative for future experiments are applied. Simulations and real\nexperiments are used to show that the proposed approach gives models that are\nas accurate as those obtained with traditional DoE but by using 85\\% less\nresources.\n","authors":["Gokhan Budan","Francesca Damiani","Can Kurtulus","N. Kemal Ure"],"pdf_url":"https://arxiv.org/pdf/2310.08198v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.09033v2","updated":"2023-10-12T10:34:03Z","published":"2023-03-16T02:07:29Z","title":"Only Pay for What Is Uncertain: Variance-Adaptive Thompson Sampling","summary":" Most bandit algorithms assume that the reward variances or their upper bounds\nare known, and that they are the same for all arms. This naturally leads to\nsuboptimal performance and higher regret due to variance overestimation. On the\nother hand, underestimated reward variances may lead to linear regret due to\ncommitting early to a suboptimal arm. This motivated prior works on\nvariance-adaptive frequentist algorithms, which have strong instance-dependent\nregret bounds but cannot incorporate prior knowledge on reward variances. We\nlay foundations for the Bayesian setting, which incorporates prior knowledge.\nThis results in lower regret in practice, due to using the prior in the\nalgorithm design, and also improved regret guarantees. Specifically, we study\nGaussian bandits with {unknown heterogeneous reward variances}, and develop a\nThompson sampling algorithm with prior-dependent Bayes regret bounds. We\nachieve lower regret with lower reward variances and more informative priors on\nthem, which is precisely why we pay only for what is uncertain. This is the\nfirst result of its kind. Finally, we corroborate our theory with extensive\nexperiments, which show the superiority of our variance-adaptive Bayesian\nalgorithm over prior frequentist approaches. We also show that our approach is\nrobust to model misspecification and can be applied with estimated priors.\n","authors":["Aadirupa Saha","Branislav Kveton"],"pdf_url":"https://arxiv.org/pdf/2303.09033v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07402v2","updated":"2023-10-12T10:30:35Z","published":"2023-10-11T11:38:18Z","title":"NuTime: Numerically Multi-Scaled Embedding for Large-Scale Time Series\n Pretraining","summary":" Recent research on time-series self-supervised models shows great promise in\nlearning semantic representations. However, it has been limited to small-scale\ndatasets, e.g., thousands of temporal sequences. In this work, we make key\ntechnical contributions that are tailored to the numerical properties of\ntime-series data and allow the model to scale to large datasets, e.g., millions\nof temporal sequences. We adopt the Transformer architecture by first\npartitioning the input into non-overlapping windows. Each window is then\ncharacterized by its normalized shape and two scalar values denoting the mean\nand standard deviation within each window. To embed scalar values that may\npossess arbitrary numerical scales to high-dimensional vectors, we propose a\nnumerically multi-scaled embedding module enumerating all possible scales for\nthe scalar values. The model undergoes pretraining using the proposed\nnumerically multi-scaled embedding with a simple contrastive objective on a\nlarge-scale dataset containing over a million sequences. We study its transfer\nperformance on a number of univariate and multivariate classification\nbenchmarks. Our method exhibits remarkable improvement against previous\nrepresentation learning approaches and establishes the new state of the art,\neven compared with domain-specific non-learning-based methods.\n","authors":["Chenguo Lin","Xumeng Wen","Wei Cao","Congrui Huang","Jiang Bian","Stephen Lin","Zhirong Wu"],"pdf_url":"https://arxiv.org/pdf/2310.07402v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08184v1","updated":"2023-10-12T10:20:36Z","published":"2023-10-12T10:20:36Z","title":"Learn From Model Beyond Fine-Tuning: A Survey","summary":" Foundation models (FM) have demonstrated remarkable performance across a wide\nrange of tasks (especially in the fields of natural language processing and\ncomputer vision), primarily attributed to their ability to comprehend\ninstructions and access extensive, high-quality data. This not only showcases\ntheir current effectiveness but also sets a promising trajectory towards the\ndevelopment of artificial general intelligence. Unfortunately, due to multiple\nconstraints, the raw data of the model used for large model training are often\ninaccessible, so the use of end-to-end models for downstream tasks has become a\nnew research trend, which we call Learn From Model (LFM) in this article. LFM\nfocuses on the research, modification, and design of FM based on the model\ninterface, so as to better understand the model structure and weights (in a\nblack box environment), and to generalize the model to downstream tasks. The\nstudy of LFM techniques can be broadly categorized into five major areas: model\ntuning, model distillation, model reuse, meta learning and model editing. Each\ncategory encompasses a repertoire of methods and strategies that aim to enhance\nthe capabilities and performance of FM. This paper gives a comprehensive review\nof the current methods based on FM from the perspective of LFM, in order to\nhelp readers better understand the current research status and ideas. To\nconclude, we summarize the survey by highlighting several critical areas for\nfuture exploration and addressing open issues that require further attention\nfrom the research community. The relevant papers we investigated in this\narticle can be accessed at\n.\n","authors":["Hongling Zheng","Li Shen","Anke Tang","Yong Luo","Han Hu","Bo Du","Dacheng Tao"],"pdf_url":"https://arxiv.org/pdf/2310.08184v1.pdf","comment":"20 pages, 9 figures"},{"id":"http://arxiv.org/abs/2310.08182v1","updated":"2023-10-12T10:17:40Z","published":"2023-10-12T10:17:40Z","title":"XIMAGENET-12: An Explainable AI Benchmark Dataset for Model Robustness\n Evaluation","summary":" The lack of standardized robustness metrics and the widespread reliance on\nnumerous unrelated benchmark datasets for testing have created a gap between\nacademically validated robust models and their often problematic practical\nadoption. To address this, we introduce XIMAGENET-12, an explainable benchmark\ndataset with over 200K images and 15,600 manual semantic annotations. Covering\n12 categories from ImageNet to represent objects commonly encountered in\npractical life and simulating six diverse scenarios, including overexposure,\nblurring, color changing, etc., we further propose a novel robustness criterion\nthat extends beyond model generation ability assessment. This benchmark\ndataset, along with related code, is available at\nhttps://sites.google.com/view/ximagenet-12/home. Researchers and practitioners\ncan leverage this resource to evaluate the robustness of their visual models\nunder challenging conditions and ultimately benefit from the demands of\npractical computer vision systems.\n","authors":["Qiang Li","Dan Zhang","Shengzhao Lei","Xun Zhao","Shuyan Li","Porawit Kamnoedboon","WeiWei Li"],"pdf_url":"https://arxiv.org/pdf/2310.08182v1.pdf","comment":"UnderSubmission"},{"id":"http://arxiv.org/abs/2310.08177v1","updated":"2023-10-12T10:03:25Z","published":"2023-10-12T10:03:25Z","title":"Improving Fast Minimum-Norm Attacks with Hyperparameter Optimization","summary":" Evaluating the adversarial robustness of machine learning models using\ngradient-based attacks is challenging. In this work, we show that\nhyperparameter optimization can improve fast minimum-norm attacks by automating\nthe selection of the loss function, the optimizer and the step-size scheduler,\nalong with the corresponding hyperparameters. Our extensive evaluation\ninvolving several robust models demonstrates the improved efficacy of fast\nminimum-norm attacks when hyper-up with hyperparameter optimization. We release\nour open-source code at https://github.com/pralab/HO-FMN.\n","authors":["Giuseppe Floris","Raffaele Mura","Luca Scionis","Giorgio Piras","Maura Pintor","Ambra Demontis","Battista Biggio"],"pdf_url":"https://arxiv.org/pdf/2310.08177v1.pdf","comment":"Accepted at ESANN23"},{"id":"http://arxiv.org/abs/2310.08176v1","updated":"2023-10-12T10:01:39Z","published":"2023-10-12T10:01:39Z","title":"Infinite Width Graph Neural Networks for Node Regression/ Classification","summary":" This work analyzes Graph Neural Networks, a generalization of Fully-Connected\nDeep Neural Nets on Graph structured data, when their width, that is the number\nof nodes in each fullyconnected layer is increasing to infinity. Infinite Width\nNeural Networks are connecting Deep Learning to Gaussian Processes and Kernels,\nboth Machine Learning Frameworks with long traditions and extensive theoretical\nfoundations. Gaussian Processes and Kernels have much less hyperparameters then\nNeural Networks and can be used for uncertainty estimation, making them more\nuser friendly for applications. This works extends the increasing amount of\nresearch connecting Gaussian Processes and Kernels to Neural Networks. The\nKernel and Gaussian Process closed forms are derived for a variety of\narchitectures, namely the standard Graph Neural Network, the Graph Neural\nNetwork with Skip-Concatenate Connections and the Graph Attention Neural\nNetwork. All architectures are evaluated on a variety of datasets on the task\nof transductive Node Regression and Classification. Additionally, a Spectral\nSparsification method known as Effective Resistance is used to improve runtime\nand memory requirements. Extending the setting to inductive graph learning\ntasks (Graph Regression/ Classification) is straightforward and is briefly\ndiscussed in 3.5.\n","authors":["Yunus Cobanoglu"],"pdf_url":"https://arxiv.org/pdf/2310.08176v1.pdf","comment":"50 Pages, 2 Figures (with subfigures)o, multiple tables"},{"id":"http://arxiv.org/abs/2308.08469v3","updated":"2023-10-12T09:58:03Z","published":"2023-08-16T16:19:50Z","title":"LLM4TS: Two-Stage Fine-Tuning for Time-Series Forecasting with\n Pre-Trained LLMs","summary":" In this work, we leverage pre-trained Large Language Models (LLMs) to enhance\ntime-series forecasting. Mirroring the growing interest in unifying models for\nNatural Language Processing and Computer Vision, we envision creating an\nanalogous model for long-term time-series forecasting. Due to limited\nlarge-scale time-series data for building robust foundation models, our\napproach LLM4TS focuses on leveraging the strengths of pre-trained LLMs. By\ncombining time-series patching with temporal encoding, we have enhanced the\ncapability of LLMs to handle time-series data effectively. Inspired by the\nsupervised fine-tuning in chatbot domains, we prioritize a two-stage\nfine-tuning process: first conducting supervised fine-tuning to orient the LLM\ntowards time-series data, followed by task-specific downstream fine-tuning.\nFurthermore, to unlock the flexibility of pre-trained LLMs without extensive\nparameter adjustments, we adopt several Parameter-Efficient Fine-Tuning (PEFT)\ntechniques. Drawing on these innovations, LLM4TS has yielded state-of-the-art\nresults in long-term forecasting. Our model has also shown exceptional\ncapabilities as both a robust representation learner and an effective few-shot\nlearner, thanks to the knowledge transferred from the pre-trained LLM.\n","authors":["Ching Chang","Wen-Chih Peng","Tien-Fu Chen"],"pdf_url":"https://arxiv.org/pdf/2308.08469v3.pdf","comment":"This paper is currently under review. The code will be made available\n upon acceptance"},{"id":"http://arxiv.org/abs/2310.08165v1","updated":"2023-10-12T09:37:56Z","published":"2023-10-12T09:37:56Z","title":"COVID-19 Detection Using Swin Transformer Approach from Computed\n Tomography Images","summary":" The accurate and efficient diagnosis of COVID-19 is of paramount importance,\nparticularly in the context of large-scale medical imaging datasets. In this\npreprint paper, we propose a novel approach for COVID-19 diagnosis using CT\nimages that leverages the power of Swin Transformer models, state-of-the-art\nsolutions in computer vision tasks. Our method includes a systematic approach\nfor patient-level predictions, where individual CT slices are classified as\nCOVID-19 or non-COVID, and the patient's overall diagnosis is determined\nthrough majority voting. The application of the Swin Transformer in this\ncontext results in patient-level predictions that demonstrate exceptional\ndiagnostic accuracy. In terms of evaluation metrics, our approach consistently\noutperforms the baseline, as well as numerous competing methods, showcasing its\neffectiveness in COVID-19 diagnosis. The macro F1 score achieved by our model\nexceeds the baseline and offers a robust solution for accurate diagnosis.\n","authors":["Kenan Morani"],"pdf_url":"https://arxiv.org/pdf/2310.08165v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08164v1","updated":"2023-10-12T09:36:03Z","published":"2023-10-12T09:36:03Z","title":"Interpreting Reward Models in RLHF-Tuned Language Models Using Sparse\n Autoencoders","summary":" Large language models (LLMs) aligned to human preferences via reinforcement\nlearning from human feedback (RLHF) underpin many commercial applications.\nHowever, how RLHF impacts LLM internals remains opaque. We propose a novel\nmethod to interpret learned reward functions in RLHF-tuned LLMs using sparse\nautoencoders. Our approach trains autoencoder sets on activations from a base\nLLM and its RLHF-tuned version. By comparing autoencoder hidden spaces, we\nidentify unique features that reflect the accuracy of the learned reward model.\nTo quantify this, we construct a scenario where the tuned LLM learns\ntoken-reward mappings to maximize reward. This is the first application of\nsparse autoencoders for interpreting learned rewards and broadly inspecting\nreward learning in LLMs. Our method provides an abstract approximation of\nreward integrity. This presents a promising technique for ensuring alignment\nbetween specified objectives and model behaviors.\n","authors":["Luke Marks","Amir Abdullah","Luna Mendez","Rauno Arike","Philip Torr","Fazl Barez"],"pdf_url":"https://arxiv.org/pdf/2310.08164v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.04234v2","updated":"2023-10-12T09:20:00Z","published":"2023-04-09T13:20:19Z","title":"Variational operator learning: A unified paradigm marrying training\n neural operators and solving partial differential equations","summary":" Neural operators as novel neural architectures for fast approximating\nsolution operators of partial differential equations (PDEs), have shown\nconsiderable promise for future scientific computing. However, the mainstream\nof training neural operators is still data-driven, which needs an expensive\nground-truth dataset from various sources (e.g., solving PDEs' samples with the\nconventional solvers, real-world experiments) in addition to training stage\ncosts. From a computational perspective, marrying operator learning and\nspecific domain knowledge to solve PDEs is an essential step in reducing\ndataset costs and label-free learning. We propose a novel paradigm that\nprovides a unified framework of training neural operators and solving PDEs with\nthe variational form, which we refer to as the variational operator learning\n(VOL). Ritz and Galerkin approach with finite element discretization are\ndeveloped for VOL to achieve matrix-free approximation of system functional and\nresidual, then direct minimization and iterative update are proposed as two\noptimization strategies for VOL. Various types of experiments based on\nreasonable benchmarks about variable heat source, Darcy flow, and variable\nstiffness elasticity are conducted to demonstrate the effectiveness of VOL.\nWith a label-free training set and a 5-label-only shift set, VOL learns\nsolution operators with its test errors decreasing in a power law with respect\nto the amount of unlabeled data. To the best of the authors' knowledge, this is\nthe first study that integrates the perspectives of the weak form and efficient\niterative methods for solving sparse linear systems into the end-to-end\noperator learning task.\n","authors":["Tengfei Xu","Dachuan Liu","Peng Hao","Bo Wang"],"pdf_url":"https://arxiv.org/pdf/2304.04234v2.pdf","comment":"35 pages, 6 figures with 5 extended figures"},{"id":"http://arxiv.org/abs/2305.14133v2","updated":"2023-10-12T09:18:09Z","published":"2023-05-23T14:56:19Z","title":"Conditional Mutual Information for Disentangled Representations in\n Reinforcement Learning","summary":" Reinforcement Learning (RL) environments can produce training data with\nspurious correlations between features due to the amount of training data or\nits limited feature coverage. This can lead to RL agents encoding these\nmisleading correlations in their latent representation, preventing the agent\nfrom generalising if the correlation changes within the environment or when\ndeployed in the real world. Disentangled representations can improve\nrobustness, but existing disentanglement techniques that minimise mutual\ninformation between features require independent features, thus they cannot\ndisentangle correlated features. We propose an auxiliary task for RL algorithms\nthat learns a disentangled representation of high-dimensional observations with\ncorrelated features by minimising the conditional mutual information between\nfeatures in the representation. We demonstrate experimentally, using continuous\ncontrol tasks, that our approach improves generalisation under correlation\nshifts, as well as improving the training performance of RL algorithms in the\npresence of correlated features.\n","authors":["Mhairi Dunion","Trevor McInroe","Kevin Sebastian Luck","Josiah P. Hanna","Stefano V. Albrecht"],"pdf_url":"https://arxiv.org/pdf/2305.14133v2.pdf","comment":"Conference on Neural Information Processing Systems (NeurIPS), 2023"},{"id":"http://arxiv.org/abs/2310.08150v1","updated":"2023-10-12T09:17:46Z","published":"2023-10-12T09:17:46Z","title":"On Extreme Value Asymptotics of Projected Sample Covariances in High\n Dimensions with Applications in Finance and Convolutional Networks","summary":" Maximum-type statistics of certain functions of the sample covariance matrix\nof high-dimensional vector time series are studied to statistically confirm or\nreject the null hypothesis that a data set has been collected under normal\nconditions. The approach generalizes the case of the maximal deviation of the\nsample autocovariances function from its assumed values. Within a linear time\nseries framework it is shown that Gumbel-type extreme value asymptotics holds\ntrue. As applications we discuss long-only mimimal-variance portfolio\noptimization and subportfolio analysis with respect to idiosyncratic risks, ETF\nindex tracking by sparse tracking portfolios, convolutional deep learners for\nimage analysis and the analysis of array-of-sensors data.\n","authors":["Ansgar Steland"],"pdf_url":"https://arxiv.org/pdf/2310.08150v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08148v1","updated":"2023-10-12T09:12:50Z","published":"2023-10-12T09:12:50Z","title":"Open-Set Knowledge-Based Visual Question Answering with Inference Paths","summary":" Given an image and an associated textual question, the purpose of\nKnowledge-Based Visual Question Answering (KB-VQA) is to provide a correct\nanswer to the question with the aid of external knowledge bases. Prior KB-VQA\nmodels are usually formulated as a retriever-classifier framework, where a\npre-trained retriever extracts textual or visual information from knowledge\ngraphs and then makes a prediction among the candidates. Despite promising\nprogress, there are two drawbacks with existing models. Firstly, modeling\nquestion-answering as multi-class classification limits the answer space to a\npreset corpus and lacks the ability of flexible reasoning. Secondly, the\nclassifier merely consider \"what is the answer\" without \"how to get the\nanswer\", which cannot ground the answer to explicit reasoning paths. In this\npaper, we confront the challenge of \\emph{explainable open-set} KB-VQA, where\nthe system is required to answer questions with entities at wild and retain an\nexplainable reasoning path. To resolve the aforementioned issues, we propose a\nnew retriever-ranker paradigm of KB-VQA, Graph pATH rankER (GATHER for\nbrevity). Specifically, it contains graph constructing, pruning, and path-level\nranking, which not only retrieves accurate answers but also provides inference\npaths that explain the reasoning process. To comprehensively evaluate our\nmodel, we reformulate the benchmark dataset OK-VQA with manually corrected\nentity-level annotations and release it as ConceptVQA. Extensive experiments on\nreal-world questions demonstrate that our framework is not only able to perform\nopen-set question answering across the whole knowledge base but provide\nexplicit reasoning path.\n","authors":["Jingru Gan","Xinzhe Han","Shuhui Wang","Qingming Huang"],"pdf_url":"https://arxiv.org/pdf/2310.08148v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.16584v2","updated":"2023-10-12T09:03:48Z","published":"2023-09-28T16:44:18Z","title":"A Design Toolbox for the Development of Collaborative Distributed\n Machine Learning Systems","summary":" To leverage data for the sufficient training of machine learning (ML) models\nfrom multiple parties in a confidentiality-preserving way, various\ncollaborative distributed ML (CDML) system designs have been developed, for\nexample, to perform assisted learning, federated learning, and split learning.\nCDML system designs show different traits, including high agent autonomy, ML\nmodel confidentiality, and fault tolerance. Facing a wide variety of CDML\nsystem designs with different traits, it is difficult for developers to design\nCDML systems with traits that match use case requirements in a targeted way.\nHowever, inappropriate CDML system designs may result in CDML systems failing\ntheir envisioned purposes. We developed a CDML design toolbox that can guide\nthe development of CDML systems. Based on the CDML design toolbox, we present\nCDML system archetypes with distinct key traits that can support the design of\nCDML systems to meet use case requirements.\n","authors":["David Jin","Niclas Kannengießer","Sascha Rank","Ali Sunyaev"],"pdf_url":"https://arxiv.org/pdf/2309.16584v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08138v1","updated":"2023-10-12T08:52:36Z","published":"2023-10-12T08:52:36Z","title":"Multi-Scale Spatial-Temporal Recurrent Networks for Traffic Flow\n Prediction","summary":" Traffic flow prediction is one of the most fundamental tasks of intelligent\ntransportation systems. The complex and dynamic spatial-temporal dependencies\nmake the traffic flow prediction quite challenging. Although existing\nspatial-temporal graph neural networks hold prominent, they often encounter\nchallenges such as (1) ignoring the fixed graph that limits the predictive\nperformance of the model, (2) insufficiently capturing complex spatial-temporal\ndependencies simultaneously, and (3) lacking attention to spatial-temporal\ninformation at different time lengths. In this paper, we propose a Multi-Scale\nSpatial-Temporal Recurrent Network for traffic flow prediction, namely MSSTRN,\nwhich consists of two different recurrent neural networks: the single-step gate\nrecurrent unit and the multi-step gate recurrent unit to fully capture the\ncomplex spatial-temporal information in the traffic data under different time\nwindows. Moreover, we propose a spatial-temporal synchronous attention\nmechanism that integrates adaptive position graph convolutions into the\nself-attention mechanism to achieve synchronous capture of spatial-temporal\ndependencies. We conducted extensive experiments on four real traffic datasets\nand demonstrated that our model achieves the best prediction accuracy with\nnon-trivial margins compared to all the twenty baseline methods.\n","authors":["Haiyang Liu","Chunjiang Zhu","Detian Zhang","Qing Li"],"pdf_url":"https://arxiv.org/pdf/2310.08138v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08137v1","updated":"2023-10-12T08:51:59Z","published":"2023-10-12T08:51:59Z","title":"Counterfactual Explanations for Time Series Forecasting","summary":" Among recent developments in time series forecasting methods, deep\nforecasting models have gained popularity as they can utilize hidden feature\npatterns in time series to improve forecasting performance. Nevertheless, the\nmajority of current deep forecasting models are opaque, hence making it\nchallenging to interpret the results. While counterfactual explanations have\nbeen extensively employed as a post-hoc approach for explaining classification\nmodels, their application to forecasting models still remains underexplored. In\nthis paper, we formulate the novel problem of counterfactual generation for\ntime series forecasting, and propose an algorithm, called ForecastCF, that\nsolves the problem by applying gradient-based perturbations to the original\ntime series. ForecastCF guides the perturbations by applying constraints to the\nforecasted values to obtain desired prediction outcomes. We experimentally\nevaluate ForecastCF using four state-of-the-art deep model architectures and\ncompare to two baselines. Our results show that ForecastCF outperforms the\nbaseline in terms of counterfactual validity and data manifold closeness.\nOverall, our findings suggest that ForecastCF can generate meaningful and\nrelevant counterfactual explanations for various forecasting tasks.\n","authors":["Zhendong Wang","Ioanna Miliou","Isak Samsten","Panagiotis Papapetrou"],"pdf_url":"https://arxiv.org/pdf/2310.08137v1.pdf","comment":"10 pages, 6 figures. Accepted by ICDM 2023"},{"id":"http://arxiv.org/abs/2310.00177v3","updated":"2023-10-12T08:51:28Z","published":"2023-09-29T22:49:47Z","title":"A Neural-preconditioned Poisson Solver for Mixed Dirichlet and Neumann\n Boundary Conditions","summary":" We introduce a neural-preconditioned iterative solver for Poisson equations\nwith mixed boundary conditions. The Poisson equation is ubiquitous in\nscientific computing: it governs a wide array of physical phenomena, arises as\na subproblem in many numerical algorithms, and serves as a model problem for\nthe broader class of elliptic PDEs. The most popular Poisson discretizations\nyield large sparse linear systems. At high resolution, and for\nperformance-critical applications, iterative solvers can be advantageous for\nthese -- but only when paired with powerful preconditioners. The core of our\nsolver is a neural network trained to approximate the inverse of a discrete\nstructured-grid Laplace operator for a domain of arbitrary shape and with mixed\nboundary conditions. The structure of this problem motivates a novel network\narchitecture that we demonstrate is highly effective as a preconditioner even\nfor boundary conditions outside the training set. We show that on challenging\ntest cases arising from an incompressible fluid simulation, our method\noutperforms state-of-the-art solvers like algebraic multigrid as well as some\nrecent neural preconditioners.\n","authors":["Kai Weixian Lan","Elias Gueidon","Ayano Kaneda","Julian Panetta","Joseph Teran"],"pdf_url":"https://arxiv.org/pdf/2310.00177v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.06118v2","updated":"2023-10-12T08:46:40Z","published":"2023-05-10T13:09:57Z","title":"NeRF2: Neural Radio-Frequency Radiance Fields","summary":" Although Maxwell discovered the physical laws of electromagnetic waves 160\nyears ago, how to precisely model the propagation of an RF signal in an\nelectrically large and complex environment remains a long-standing problem. The\ndifficulty is in the complex interactions between the RF signal and the\nobstacles (e.g., reflection, diffraction, etc.). Inspired by the great success\nof using a neural network to describe the optical field in computer vision, we\npropose a neural radio-frequency radiance field, NeRF$^\\textbf{2}$, which\nrepresents a continuous volumetric scene function that makes sense of an RF\nsignal's propagation. Particularly, after training with a few signal\nmeasurements, NeRF$^\\textbf{2}$ can tell how/what signal is received at any\nposition when it knows the position of a transmitter. As a physical-layer\nneural network, NeRF$^\\textbf{2}$ can take advantage of the learned statistic\nmodel plus the physical model of ray tracing to generate a synthetic dataset\nthat meets the training demands of application-layer artificial neural networks\n(ANNs). Thus, we can boost the performance of ANNs by the proposed\nturbo-learning, which mixes the true and synthetic datasets to intensify the\ntraining. Our experiment results show that turbo-learning can enhance\nperformance with an approximate 50% increase. We also demonstrate the power of\nNeRF$^\\textbf{2}$ in the field of indoor localization and 5G MIMO.\n","authors":["Xiaopeng Zhao","Zhenlin An","Qingrui Pan","Lei Yang"],"pdf_url":"https://arxiv.org/pdf/2305.06118v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.02286v3","updated":"2023-10-12T08:40:31Z","published":"2022-10-05T14:22:24Z","title":"Efficient probabilistic reconciliation of forecasts for real-valued and\n count time series","summary":" Hierarchical time series are common in several applied fields. The forecasts\nfor these time series are required to be coherent, that is, to satisfy the\nconstraints given by the hierarchy. The most popular technique to enforce\ncoherence is called reconciliation, which adjusts the base forecasts computed\nfor each time series. However, recent works on probabilistic reconciliation\npresent several limitations. In this paper, we propose a new approach based on\nconditioning to reconcile any type of forecast distribution. We then introduce\na new algorithm, called Bottom-Up Importance Sampling, to efficiently sample\nfrom the reconciled distribution. It can be used for any base forecast\ndistribution: discrete, continuous, or in the form of samples, providing a\nmajor speedup compared to the current methods. Experiments on several temporal\nhierarchies show a significant improvement over base probabilistic forecasts.\n","authors":["Lorenzo Zambon","Dario Azzimonti","Giorgio Corani"],"pdf_url":"https://arxiv.org/pdf/2210.02286v3.pdf","comment":"27 pages, 4 figures"},{"id":"http://arxiv.org/abs/2310.08122v1","updated":"2023-10-12T08:24:02Z","published":"2023-10-12T08:24:02Z","title":"Core-sets for Fair and Diverse Data Summarization","summary":" We study core-set construction algorithms for the task of Diversity\nMaximization under fairness/partition constraint. Given a set of points $P$ in\na metric space partitioned into $m$ groups, and given $k_1,\\ldots,k_m$, the\ngoal of this problem is to pick $k_i$ points from each group $i$ such that the\noverall diversity of the $k=\\sum_i k_i$ picked points is maximized. We consider\ntwo natural diversity measures: sum-of-pairwise distances and\nsum-of-nearest-neighbor distances, and show improved core-set construction\nalgorithms with respect to these measures. More precisely, we show the first\nconstant factor core-set w.r.t. sum-of-pairwise distances whose size is\nindependent of the size of the dataset and the aspect ratio. Second, we show\nthe first core-set w.r.t. the sum-of-nearest-neighbor distances. Finally, we\nrun several experiments showing the effectiveness of our core-set approach. In\nparticular, we apply constrained diversity maximization to summarize a set of\ntimed messages that takes into account the messages' recency. Specifically, the\nsummary should include more recent messages compared to older ones. This is a\nreal task in one of the largest communication platforms, affecting the\nexperience of hundreds of millions daily active users. By utilizing our\ncore-set method for this task, we achieve a 100x speed-up while losing the\ndiversity by only a few percent. Moreover, our approach allows us to improve\nthe space usage of the algorithm in the streaming setting.\n","authors":["Sepideh Mahabadi","Stojan Trajanovski"],"pdf_url":"https://arxiv.org/pdf/2310.08122v1.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.08109v1","updated":"2023-10-12T08:10:31Z","published":"2023-10-12T08:10:31Z","title":"Overview of Physics-Informed Machine Learning Inversion of Geophysical\n Data","summary":" We review four types of algorithms for physics-informed machine learning\n(PIML) inversion of geophysical data. The unifying equation is given by the\njoint objective function $\\epsilon$:\n \\begin{eqnarray} \\epsilon^{||-PIML}&=&\\lambda_1 \\overbrace{||{\\bf\nW}^{ML}({\\bf H}_{{\\bf w}} {\\bf d}^{obs}-{\\bf m})||^2}^{NN} + \\lambda_2\n\\overbrace{{||{\\bf W}^{FWI}({\\bf L} {\\bf m}-{\\bf d}^{obs})||^2}}^{FWI} ~+\n\\nonumber\\\\ \\nonumber\\\\ && + ~~Regularizer, \\label{PIML.eq120}\n\\end{eqnarray}where the optimal model ${\\bf m}^*$ and weights $\\bf w^*$\nminimize $\\epsilon$. Here, The matrix weights are given by the boldface symbol\n$\\bf W$, and full waveform inversion (FWI) is typically computed using a\nfinite-difference solution of the wave equation, where $\\bf L$ represents the\nforward modeling operation of the wave equation as a function of the model $\\bf\nm$. Also, a fully-connected neural network (NN) is used to compute the model\n${\\bf H_w}{\\bf d}^{obs} \\approx \\bf m$ from the observed input data ${\\bf\nd}^{obs}$. The selection of weights $\\lambda_i$ and the NN operations determine\none of four different PIML algorithms.\n PIML offers potential advantages over standard FWI through its enhanced\nability to avoid local minima and the option to locally train the inversion\noperator, minimizing the requirement for extensive training data for global\napplicability. However, the effectiveness of PIML relies on the similarity\nbetween the test and trained data. Nevertheless, a possible strategy to\novercome this limitation involves initial pretraining of a PIML architecture\nwith data from a broader region, followed by fine-tuning for specific data-a\nmethod reminiscent of the way large language models are pretrained and adapted\nfor various tasks.\n","authors":["Gerard T. Schuster","Shihang Feng"],"pdf_url":"https://arxiv.org/pdf/2310.08109v1.pdf","comment":"37 pages, 16 figures"},{"id":"http://arxiv.org/abs/2209.10404v3","updated":"2023-10-12T07:55:10Z","published":"2022-09-21T14:51:42Z","title":"GP-net: Flexible Viewpoint Grasp Proposal","summary":" We present the Grasp Proposal Network (GP-net), a Convolutional Neural\nNetwork model which can generate 6-DoF grasps from flexible viewpoints, e.g. as\nexperienced by mobile manipulators. To train GP-net, we synthetically generate\na dataset containing depth-images and ground-truth grasp information. In\nreal-world experiments, we use the EGAD evaluation benchmark to evaluate GP-net\nagainst two commonly used algorithms, the Volumetric Grasping Network (VGN) and\nthe Grasp Pose Detection package (GPD), on a PAL TIAGo mobile manipulator. In\ncontrast to the state-of-the-art methods in robotic grasping, GP-net can be\nused for grasping objects from flexible, unknown viewpoints without the need to\ndefine the workspace and achieves a grasp success of 54.4% compared to 51.6%\nfor VGN and 44.2% for GPD. We provide a ROS package along with our code and\npre-trained models at https://aucoroboticsmu.github.io/GP-net/.\n","authors":["Anna Konrad","John McDonald","Rudi Villing"],"pdf_url":"https://arxiv.org/pdf/2209.10404v3.pdf","comment":"Accepted to ICAR 2023"},{"id":"http://arxiv.org/abs/2309.15505v2","updated":"2023-10-12T07:55:05Z","published":"2023-09-27T09:13:40Z","title":"Finite Scalar Quantization: VQ-VAE Made Simple","summary":" We propose to replace vector quantization (VQ) in the latent representation\nof VQ-VAEs with a simple scheme termed finite scalar quantization (FSQ), where\nwe project the VAE representation down to a few dimensions (typically less than\n10). Each dimension is quantized to a small set of fixed values, leading to an\n(implicit) codebook given by the product of these sets. By appropriately\nchoosing the number of dimensions and values each dimension can take, we obtain\nthe same codebook size as in VQ. On top of such discrete representations, we\ncan train the same models that have been trained on VQ-VAE representations. For\nexample, autoregressive and masked transformer models for image generation,\nmultimodal generation, and dense prediction computer vision tasks. Concretely,\nwe employ FSQ with MaskGIT for image generation, and with UViM for depth\nestimation, colorization, and panoptic segmentation. Despite the much simpler\ndesign of FSQ, we obtain competitive performance in all these tasks. We\nemphasize that FSQ does not suffer from codebook collapse and does not need the\ncomplex machinery employed in VQ (commitment losses, codebook reseeding, code\nsplitting, entropy penalties, etc.) to learn expressive discrete\nrepresentations.\n","authors":["Fabian Mentzer","David Minnen","Eirikur Agustsson","Michael Tschannen"],"pdf_url":"https://arxiv.org/pdf/2309.15505v2.pdf","comment":"Code:\n https://github.com/google-research/google-research/tree/master/fsq"},{"id":"http://arxiv.org/abs/2310.08100v1","updated":"2023-10-12T07:50:37Z","published":"2023-10-12T07:50:37Z","title":"Generative Intrinsic Optimization: Intrisic Control with Model Learning","summary":" Future sequence represents the outcome after executing the action into the\nenvironment. When driven by the information-theoretic concept of mutual\ninformation, it seeks maximally informative consequences. Explicit outcomes may\nvary across state, return, or trajectory serving different purposes such as\ncredit assignment or imitation learning. However, the inherent nature of\nincorporating intrinsic motivation with reward maximization is often neglected.\nIn this work, we propose a variational approach to jointly learn the necessary\nquantity for estimating the mutual information and the dynamics model,\nproviding a general framework for incorporating different forms of outcomes of\ninterest. Integrated into a policy iteration scheme, our approach guarantees\nconvergence to the optimal policy. While we mainly focus on theoretical\nanalysis, our approach opens the possibilities of leveraging intrinsic control\nwith model learning to enhance sample efficiency and incorporate uncertainty of\nthe environment into decision-making.\n","authors":["Jianfei Ma"],"pdf_url":"https://arxiv.org/pdf/2310.08100v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08096v1","updated":"2023-10-12T07:43:27Z","published":"2023-10-12T07:43:27Z","title":"ClimateBERT-NetZero: Detecting and Assessing Net Zero and Reduction\n Targets","summary":" Public and private actors struggle to assess the vast amounts of information\nabout sustainability commitments made by various institutions. To address this\nproblem, we create a novel tool for automatically detecting corporate,\nnational, and regional net zero and reduction targets in three steps. First, we\nintroduce an expert-annotated data set with 3.5K text samples. Second, we train\nand release ClimateBERT-NetZero, a natural language classifier to detect\nwhether a text contains a net zero or reduction target. Third, we showcase its\nanalysis potential with two use cases: We first demonstrate how\nClimateBERT-NetZero can be combined with conventional question-answering (Q&A)\nmodels to analyze the ambitions displayed in net zero and reduction targets.\nFurthermore, we employ the ClimateBERT-NetZero model on quarterly earning call\ntranscripts and outline how communication patterns evolve over time. Our\nexperiments demonstrate promising pathways for extracting and analyzing net\nzero and emission reduction targets at scale.\n","authors":["Tobias Schimanski","Julia Bingler","Camilla Hyslop","Mathias Kraus","Markus Leippold"],"pdf_url":"https://arxiv.org/pdf/2310.08096v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08091v1","updated":"2023-10-12T07:38:10Z","published":"2023-10-12T07:38:10Z","title":"Discerning Temporal Difference Learning","summary":" Temporal difference learning (TD) is a foundational concept in reinforcement\nlearning (RL), aimed at efficiently assessing a policy's value function.\nTD($\\lambda$), a potent variant, incorporates a memory trace to distribute the\nprediction error into the historical context. However, this approach often\nneglects the significance of historical states and the relative importance of\npropagating the TD error, influenced by challenges such as visitation imbalance\nor outcome noise. To address this, we propose a novel TD algorithm named\ndiscerning TD learning (DTD), which allows flexible emphasis\nfunctions$-$predetermined or adapted during training$-$to allocate efforts\neffectively across states. We establish the convergence properties of our\nmethod within a specific class of emphasis functions and showcase its promising\npotential for adaptation to deep RL contexts. Empirical results underscore that\nemploying a judicious emphasis function not only improves value estimation but\nalso expedites learning across diverse scenarios.\n","authors":["Jianfei Ma"],"pdf_url":"https://arxiv.org/pdf/2310.08091v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08088v1","updated":"2023-10-12T07:26:41Z","published":"2023-10-12T07:26:41Z","title":"Dealing with zero-inflated data: achieving SOTA with a two-fold machine\n learning approach","summary":" In many cases, a machine learning model must learn to correctly predict a few\ndata points with particular values of interest in a broader range of data where\nmany target values are zero. Zero-inflated data can be found in diverse\nscenarios, such as lumpy and intermittent demands, power consumption for home\nappliances being turned on and off, impurities measurement in distillation\nprocesses, and even airport shuttle demand prediction. The presence of zeroes\naffects the models' learning and may result in poor performance. Furthermore,\nzeroes also distort the metrics used to compute the model's prediction quality.\nThis paper showcases two real-world use cases (home appliances classification\nand airport shuttle demand prediction) where a hierarchical model applied in\nthe context of zero-inflated data leads to excellent results. In particular,\nfor home appliances classification, the weighted average of Precision, Recall,\nF1, and AUC ROC was increased by 27%, 34%, 49%, and 27%, respectively.\nFurthermore, it is estimated that the proposed approach is also four times more\nenergy efficient than the SOTA approach against which it was compared to.\nTwo-fold models performed best in all cases when predicting airport shuttle\ndemand, and the difference against other models has been proven to be\nstatistically significant.\n","authors":["Jože M. Rožanec","Gašper Petelin","João Costa","Blaž Bertalanič","Gregor Cerar","Marko Guček","Gregor Papa","Dunja Mladenić"],"pdf_url":"https://arxiv.org/pdf/2310.08088v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08087v1","updated":"2023-10-12T07:20:03Z","published":"2023-10-12T07:20:03Z","title":"A Carbon Tracking Model for Federated Learning: Impact of Quantization\n and Sparsification","summary":" Federated Learning (FL) methods adopt efficient communication technologies to\ndistribute machine learning tasks across edge devices, reducing the overhead in\nterms of data storage and computational complexity compared to centralized\nsolutions. Rather than moving large data volumes from producers (sensors,\nmachines) to energy-hungry data centers, raising environmental concerns due to\nresource demands, FL provides an alternative solution to mitigate the energy\ndemands of several learning tasks while enabling new Artificial Intelligence of\nThings (AIoT) applications. This paper proposes a framework for real-time\nmonitoring of the energy and carbon footprint impacts of FL systems. The carbon\ntracking tool is evaluated for consensus (fully decentralized) and classical FL\npolicies. For the first time, we present a quantitative evaluation of different\ncomputationally and communication efficient FL methods from the perspectives of\nenergy consumption and carbon equivalent emissions, suggesting also general\nguidelines for energy-efficient design. Results indicate that consensus-driven\nFL implementations should be preferred for limiting carbon emissions when the\nenergy efficiency of the communication is low (i.e., < 25 Kbit/Joule). Besides,\nquantization and sparsification operations are shown to strike a balance\nbetween learning performances and energy consumption, leading to sustainable FL\ndesigns.\n","authors":["Luca Barbieri","Stefano Savazzi","Sanaz Kianoush","Monica Nicoli","Luigi Serio"],"pdf_url":"https://arxiv.org/pdf/2310.08087v1.pdf","comment":"accepted for presentation at IEEE CAMAD 2023"},{"id":"http://arxiv.org/abs/2307.03486v2","updated":"2023-10-12T07:13:32Z","published":"2023-07-07T09:47:15Z","title":"Discovering Hierarchical Achievements in Reinforcement Learning via\n Contrastive Learning","summary":" Discovering achievements with a hierarchical structure in procedurally\ngenerated environments presents a significant challenge. This requires an agent\nto possess a broad range of abilities, including generalization and long-term\nreasoning. Many prior methods have been built upon model-based or hierarchical\napproaches, with the belief that an explicit module for long-term planning\nwould be advantageous for learning hierarchical dependencies. However, these\nmethods demand an excessive number of environment interactions or large model\nsizes, limiting their practicality. In this work, we demonstrate that proximal\npolicy optimization (PPO), a simple yet versatile model-free algorithm,\noutperforms previous methods when optimized with recent implementation\npractices. Moreover, we find that the PPO agent can predict the next\nachievement to be unlocked to some extent, albeit with limited confidence.\nBased on this observation, we introduce a novel contrastive learning method,\ncalled achievement distillation, which strengthens the agent's ability to\npredict the next achievement. Our method exhibits a strong capacity for\ndiscovering hierarchical achievements and shows state-of-the-art performance on\nthe challenging Crafter environment in a sample-efficient manner while\nutilizing fewer model parameters.\n","authors":["Seungyong Moon","Junyoung Yeom","Bumsoo Park","Hyun Oh Song"],"pdf_url":"https://arxiv.org/pdf/2307.03486v2.pdf","comment":"Accepted at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.08078v1","updated":"2023-10-12T06:59:10Z","published":"2023-10-12T06:59:10Z","title":"To token or not to token: A Comparative Study of Text Representations\n for Cross-Lingual Transfer","summary":" Choosing an appropriate tokenization scheme is often a bottleneck in\nlow-resource cross-lingual transfer. To understand the downstream implications\nof text representation choices, we perform a comparative analysis on language\nmodels having diverse text representation modalities including 2\nsegmentation-based models (\\texttt{BERT}, \\texttt{mBERT}), 1 image-based model\n(\\texttt{PIXEL}), and 1 character-level model (\\texttt{CANINE}). First, we\npropose a scoring Language Quotient (LQ) metric capable of providing a weighted\nrepresentation of both zero-shot and few-shot evaluation combined. Utilizing\nthis metric, we perform experiments comprising 19 source languages and 133\ntarget languages on three tasks (POS tagging, Dependency parsing, and NER). Our\nanalysis reveals that image-based models excel in cross-lingual transfer when\nlanguages are closely related and share visually similar scripts. However, for\ntasks biased toward word meaning (POS, NER), segmentation-based models prove to\nbe superior. Furthermore, in dependency parsing tasks where word relationships\nplay a crucial role, models with their character-level focus, outperform\nothers. Finally, we propose a recommendation scheme based on our findings to\nguide model selection according to task and language requirements.\n","authors":["Md Mushfiqur Rahman","Fardin Ahsan Sakib","Fahim Faisal","Antonios Anastasopoulos"],"pdf_url":"https://arxiv.org/pdf/2310.08078v1.pdf","comment":"Accepted at 3RD MULTILINGUAL REPRESENTATION LEARNING (MRL) WORKSHOP,\n 2023"},{"id":"http://arxiv.org/abs/2310.07312v2","updated":"2023-10-12T06:57:51Z","published":"2023-10-11T08:57:59Z","title":"WiGenAI: The Symphony of Wireless and Generative AI via Diffusion Models","summary":" Innovative foundation models, such as GPT-3 and stable diffusion models, have\nmade a paradigm shift in the realm of artificial intelligence (AI) towards\ngenerative AI-based systems. In unison, from data communication and networking\nperspective, AI and machine learning (AI/ML) algorithms are envisioned to be\npervasively incorporated into the future generations of wireless communications\nsystems, highlighting the need for novel AI-native solutions for the emergent\ncommunication scenarios. In this article, we outline the applications of\ngenerative AI in wireless communication systems to lay the foundations for\nresearch in this field. Diffusion-based generative models, as the new\nstate-of-the-art paradigm of generative models, are introduced, and their\napplications in wireless communication systems are discussed. Two case studies\nare also presented to showcase how diffusion models can be exploited for the\ndevelopment of resilient AI-native communication systems. Specifically, we\npropose denoising diffusion probabilistic models (DDPM) for a wireless\ncommunication scheme with non-ideal transceivers, where 30% improvement is\nachieved in terms of bit error rate. As the second application, DDPMs are\nemployed at the transmitter to shape the constellation symbols, highlighting a\nrobust out-of-distribution performance. Finally, future directions and open\nissues for the development of generative AI-based wireless systems are\ndiscussed to promote future research endeavors towards wireless generative AI\n(WiGenAI).\n","authors":["Mehdi Letafati","Samad Ali","Matti Latva-aho"],"pdf_url":"https://arxiv.org/pdf/2310.07312v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08073v1","updated":"2023-10-12T06:50:43Z","published":"2023-10-12T06:50:43Z","title":"Samples on Thin Ice: Re-Evaluating Adversarial Pruning of Neural\n Networks","summary":" Neural network pruning has shown to be an effective technique for reducing\nthe network size, trading desirable properties like generalization and\nrobustness to adversarial attacks for higher sparsity. Recent work has claimed\nthat adversarial pruning methods can produce sparse networks while also\npreserving robustness to adversarial examples. In this work, we first\nre-evaluate three state-of-the-art adversarial pruning methods, showing that\ntheir robustness was indeed overestimated. We then compare pruned and dense\nversions of the same models, discovering that samples on thin ice, i.e., closer\nto the unpruned model's decision boundary, are typically misclassified after\npruning. We conclude by discussing how this intuition may lead to designing\nmore effective adversarial pruning methods in future work.\n","authors":["Giorgio Piras","Maura Pintor","Ambra Demontis","Battista Biggio"],"pdf_url":"https://arxiv.org/pdf/2310.08073v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.02010v2","updated":"2023-10-12T06:46:25Z","published":"2023-06-03T05:45:29Z","title":"Memorization Capacity of Multi-Head Attention in Transformers","summary":" Transformers have become the go-to architecture for language and vision\ntasks, yet their theoretical properties, especially memorization capacity,\nremain elusive. This paper investigates the memorization abilities of\nmulti-head attention mechanisms, examining how many example sequences they can\nmemorize, as a function of the number of heads and sequence length. Motivated\nby experimental findings on vision transformers, we introduce novel assumptions\nabout the linear independence of input data, distinct from the commonly used\ngeneral-position assumption. Under these assumptions, we demonstrate that an\nattention layer with $H$ heads, dimension $d$, and context size $n < d$,\nfeaturing $\\Theta(Hd^2)$ parameters, can memorize $\\Omega(Hn)$ examples. Our\nanalysis sheds light on how different attention heads handle various example\nsequences, aided by the softmax operator's saturation property. We validate our\nfindings through experiments on synthetic data.\n","authors":["Sadegh Mahdavi","Renjie Liao","Christos Thrampoulidis"],"pdf_url":"https://arxiv.org/pdf/2306.02010v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08071v1","updated":"2023-10-12T06:36:41Z","published":"2023-10-12T06:36:41Z","title":"Learning Transferable Conceptual Prototypes for Interpretable\n Unsupervised Domain Adaptation","summary":" Despite the great progress of unsupervised domain adaptation (UDA) with the\ndeep neural networks, current UDA models are opaque and cannot provide\npromising explanations, limiting their applications in the scenarios that\nrequire safe and controllable model decisions. At present, a surge of work\nfocuses on designing deep interpretable methods with adequate data annotations\nand only a few methods consider the distributional shift problem. Most existing\ninterpretable UDA methods are post-hoc ones, which cannot facilitate the model\nlearning process for performance enhancement. In this paper, we propose an\ninherently interpretable method, named Transferable Conceptual Prototype\nLearning (TCPL), which could simultaneously interpret and improve the processes\nof knowledge transfer and decision-making in UDA. To achieve this goal, we\ndesign a hierarchically prototypical module that transfers categorical basic\nconcepts from the source domain to the target domain and learns domain-shared\nprototypes for explaining the underlying reasoning process. With the learned\ntransferable prototypes, a self-predictive consistent pseudo-label strategy\nthat fuses confidence, predictions, and prototype information, is designed for\nselecting suitable target samples for pseudo annotations and gradually\nnarrowing down the domain gap. Comprehensive experiments show that the proposed\nmethod can not only provide effective and intuitive explanations but also\noutperform previous state-of-the-arts.\n","authors":["Junyu Gao","Xinhong Ma","Changsheng Xu"],"pdf_url":"https://arxiv.org/pdf/2310.08071v1.pdf","comment":"Submitted to IEEE TIP"},{"id":"http://arxiv.org/abs/2310.08070v1","updated":"2023-10-12T06:36:31Z","published":"2023-10-12T06:36:31Z","title":"Tight Time-Space Lower Bounds for Constant-Pass Learning","summary":" In his breakthrough paper, Raz showed that any parity learning algorithm\nrequires either quadratic memory or an exponential number of samples [FOCS'16,\nJACM'19]. A line of work that followed extended this result to a large class of\nlearning problems. Until recently, all these results considered learning in the\nstreaming model, where each sample is drawn independently, and the learner is\nallowed a single pass over the stream of samples. Garg, Raz, and Tal [CCC'19]\nconsidered a stronger model, allowing multiple passes over the stream. In the\n$2$-pass model, they showed that learning parities of size $n$ requires either\na memory of size $n^{1.5}$ or at least $2^{\\sqrt{n}}$ samples. (Their result\nalso generalizes to other learning problems.)\n In this work, for any constant $q$, we prove tight memory-sample lower bounds\nfor any parity learning algorithm that makes $q$ passes over the stream of\nsamples. We show that such a learner requires either $\\Omega(n^{2})$ memory\nsize or at least $2^{\\Omega(n)}$ samples. Beyond establishing a tight lower\nbound, this is the first non-trivial lower bound for $q$-pass learning for any\n$q\\ge 3$. Similar to prior work, our results extend to any learning problem\nwith many nearly-orthogonal concepts.\n We complement the lower bound with an upper bound, showing that parity\nlearning with $q$ passes can be done efficiently with $O(n^2/\\log q)$ memory.\n","authors":["Xin Lyu","Avishay Tal","Hongxun Wu","Junzhao Yang"],"pdf_url":"https://arxiv.org/pdf/2310.08070v1.pdf","comment":"To appear at FOCS 2023"},{"id":"http://arxiv.org/abs/2310.08069v1","updated":"2023-10-12T06:32:42Z","published":"2023-10-12T06:32:42Z","title":"Rethinking Negative Pairs in Code Search","summary":" Recently, contrastive learning has become a key component in fine-tuning code\nsearch models for software development efficiency and effectiveness. It pulls\ntogether positive code snippets while pushing negative samples away given\nsearch queries. Among contrastive learning, InfoNCE is the most widely used\nloss function due to its better performance. However, the following problems in\nnegative samples of InfoNCE may deteriorate its representation learning: 1) The\nexistence of false negative samples in large code corpora due to duplications.\n2). The failure to explicitly differentiate between the potential relevance of\nnegative samples. As an example, a bubble sorting algorithm example is less\n``negative'' than a file saving function for the quick sorting algorithm query.\nIn this paper, we tackle the above problems by proposing a simple yet effective\nSoft-InfoNCE loss that inserts weight terms into InfoNCE. In our proposed loss\nfunction, we apply three methods to estimate the weights of negative pairs and\nshow that the vanilla InfoNCE loss is a special case of Soft-InfoNCE.\nTheoretically, we analyze the effects of Soft-InfoNCE on controlling the\ndistribution of learnt code representations and on deducing a more precise\nmutual information estimation. We furthermore discuss the superiority of\nproposed loss functions with other design alternatives. Extensive experiments\ndemonstrate the effectiveness of Soft-InfoNCE and weights estimation methods\nunder state-of-the-art code search models on a large-scale public dataset\nconsisting of six programming languages. Source code is available at\n\\url{https://github.com/Alex-HaochenLi/Soft-InfoNCE}.\n","authors":["Haochen Li","Xin Zhou","Luu Anh Tuan","Chunyan Miao"],"pdf_url":"https://arxiv.org/pdf/2310.08069v1.pdf","comment":"Accepted to EMNLP 2023"},{"id":"http://arxiv.org/abs/2309.03084v2","updated":"2023-10-12T06:24:33Z","published":"2023-09-04T09:16:49Z","title":"Pure Monte Carlo Counterfactual Regret Minimization","summary":" Counterfactual Regret Minimization (CFR) and its variants are the best\nalgorithms so far for solving large-scale incomplete information games.\nHowever, we believe that there are two problems with CFR: First, matrix\nmultiplication is required in CFR iteration, and the time complexity of one\niteration is too high; Secondly, the game characteristics in the real world are\ndifferent. Just using one CFR algorithm will not be perfectly suitable for all\ngame problems.\n For these two problems, this paper proposes a new algorithm called Pure CFR\n(PCFR) based on CFR. PCFR can be seen as a combination of CFR and Fictitious\nPlay (FP), inheriting the concept of counterfactual regret (value) from CFR,\nand using the best response strategy instead of the regret matching strategy\nfor the next iteration. This algorithm has three advantages. First, PCFR can be\ncombined with any CFR variant. The resulting Pure MCCFR (PMCCFR) can\nsignificantly reduce the time and space complexity of one iteration. Secondly,\nour experiments show that the convergence speed of the PMCCFR is 2$\\sim$3 times\nthat of the MCCFR. Finally, there is a type of game that is very suitable for\nPCFR, we call this type of game clear-game, which is characterized by a high\nproportion of dominated strategies. Experiments show that in clear-game, the\nconvergence rate of PMCCFR is two orders of magnitude higher than that of\nMCCFR.\n","authors":["Ju Qi","Ting Feng","Falun Hei","Zhemei Fang","Yunfeng Luo"],"pdf_url":"https://arxiv.org/pdf/2309.03084v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08061v1","updated":"2023-10-12T06:23:12Z","published":"2023-10-12T06:23:12Z","title":"ETDock: A Novel Equivariant Transformer for Protein-Ligand Docking","summary":" Predicting the docking between proteins and ligands is a crucial and\nchallenging task for drug discovery. However, traditional docking methods\nmainly rely on scoring functions, and deep learning-based docking approaches\nusually neglect the 3D spatial information of proteins and ligands, as well as\nthe graph-level features of ligands, which limits their performance. To address\nthese limitations, we propose an equivariant transformer neural network for\nprotein-ligand docking pose prediction. Our approach involves the fusion of\nligand graph-level features by feature processing, followed by the learning of\nligand and protein representations using our proposed TAMformer module.\nAdditionally, we employ an iterative optimization approach based on the\npredicted distance matrix to generate refined ligand poses. The experimental\nresults on real datasets show that our model can achieve state-of-the-art\nperformance.\n","authors":["Yiqiang Yi","Xu Wan","Yatao Bian","Le Ou-Yang","Peilin Zhao"],"pdf_url":"https://arxiv.org/pdf/2310.08061v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08056v1","updated":"2023-10-12T06:09:26Z","published":"2023-10-12T06:09:26Z","title":"Learning from Label Proportions: Bootstrapping Supervised Learners via\n Belief Propagation","summary":" Learning from Label Proportions (LLP) is a learning problem where only\naggregate level labels are available for groups of instances, called bags,\nduring training, and the aim is to get the best performance at the\ninstance-level on the test data. This setting arises in domains like\nadvertising and medicine due to privacy considerations. We propose a novel\nalgorithmic framework for this problem that iteratively performs two main\nsteps. For the first step (Pseudo Labeling) in every iteration, we define a\nGibbs distribution over binary instance labels that incorporates a) covariate\ninformation through the constraint that instances with similar covariates\nshould have similar labels and b) the bag level aggregated label. We then use\nBelief Propagation (BP) to marginalize the Gibbs distribution to obtain pseudo\nlabels. In the second step (Embedding Refinement), we use the pseudo labels to\nprovide supervision for a learner that yields a better embedding. Further, we\niterate on the two steps again by using the second step's embeddings as new\ncovariates for the next iteration. In the final iteration, a classifier is\ntrained using the pseudo labels. Our algorithm displays strong gains against\nseveral SOTA baselines (up to 15%) for the LLP Binary Classification problem on\nvarious dataset types - tabular and Image. We achieve these improvements with\nminimal computational overhead above standard supervised learning due to Belief\nPropagation, for large bag sizes, even for a million samples.\n","authors":["Shreyas Havaldar","Navodita Sharma","Shubhi Sareen","Karthikeyan Shanmugam","Aravindan Raghuveer"],"pdf_url":"https://arxiv.org/pdf/2310.08056v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07365v2","updated":"2023-10-12T06:00:52Z","published":"2023-10-11T10:30:49Z","title":"GraphControl: Adding Conditional Control to Universal Graph Pre-trained\n Models for Graph Domain Transfer Learning","summary":" Graph-structured data is ubiquitous in the world which models complex\nrelationships between objects, enabling various Web applications. Daily\ninfluxes of unlabeled graph data on the Web offer immense potential for these\napplications. Graph self-supervised algorithms have achieved significant\nsuccess in acquiring generic knowledge from abundant unlabeled graph data.\nThese pre-trained models can be applied to various downstream Web applications,\nsaving training time and improving downstream (target) performance. However,\ndifferent graphs, even across seemingly similar domains, can differ\nsignificantly in terms of attribute semantics, posing difficulties, if not\ninfeasibility, for transferring the pre-trained models to downstream tasks.\nConcretely speaking, for example, the additional task-specific node information\nin downstream tasks (specificity) is usually deliberately omitted so that the\npre-trained representation (transferability) can be leveraged. The trade-off as\nsuch is termed as \"transferability-specificity dilemma\" in this work. To\naddress this challenge, we introduce an innovative deployment module coined as\nGraphControl, motivated by ControlNet, to realize better graph domain transfer\nlearning. Specifically, by leveraging universal structural pre-trained models\nand GraphControl, we align the input space across various graphs and\nincorporate unique characteristics of target data as conditional inputs. These\nconditions will be progressively integrated into the model during fine-tuning\nor prompt tuning through ControlNet, facilitating personalized deployment.\nExtensive experiments show that our method significantly enhances the\nadaptability of pre-trained models on target attributed datasets, achieving\n1.4-3x performance gain. Furthermore, it outperforms training-from-scratch\nmethods on target data with a comparable margin and exhibits faster\nconvergence.\n","authors":["Yun Zhu","Yaoke Wang","Haizhou Shi","Zhenshuo Zhang","Siliang Tang"],"pdf_url":"https://arxiv.org/pdf/2310.07365v2.pdf","comment":"Under Review"},{"id":"http://arxiv.org/abs/2310.08051v1","updated":"2023-10-12T05:52:54Z","published":"2023-10-12T05:52:54Z","title":"LGL-BCI: A Lightweight Geometric Learning Framework for Motor\n Imagery-Based Brain-Computer Interfaces","summary":" Brain-Computer Interfaces (BCIs) are a groundbreaking technology for\ninteracting with external devices using brain signals. Despite advancements,\nelectroencephalogram (EEG)-based Motor Imagery (MI) tasks face challenges like\namplitude and phase variability, and complex spatial correlations, with a need\nfor smaller model size and faster inference. This study introduces the LGL-BCI\nframework, employing a Geometric Deep Learning Framework for EEG processing in\nnon-Euclidean metric spaces, particularly the Symmetric Positive Definite (SPD)\nManifold space. LGL-BCI offers robust EEG data representation and captures\nspatial correlations. We propose an EEG channel selection solution via a\nfeature decomposition algorithm to reduce SPD matrix dimensionality, with a\nlossless transformation boosting inference speed. Extensive experiments show\nLGL-BCI's superior accuracy and efficiency compared to current solutions,\nhighlighting geometric deep learning's potential in MI-BCI applications. The\nefficiency, assessed on two public EEG datasets and two real-world EEG devices,\nsignificantly outperforms the state-of-the-art solution in accuracy ($82.54\\%$\nversus $62.22\\%$) with fewer parameters (64.9M compared to 183.7M).\n","authors":["Jianchao Lu","Yuzhe Tian","Yang Zhang","Jiaqi Ge","Quan Z. Sheng","Xi Zheng"],"pdf_url":"https://arxiv.org/pdf/2310.08051v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.06599v4","updated":"2023-10-12T05:51:30Z","published":"2023-06-11T06:27:06Z","title":"Variational Imbalanced Regression: Fair Uncertainty Quantification via\n Probabilistic Smoothing","summary":" Existing regression models tend to fall short in both accuracy and\nuncertainty estimation when the label distribution is imbalanced. In this\npaper, we propose a probabilistic deep learning model, dubbed variational\nimbalanced regression (VIR), which not only performs well in imbalanced\nregression but naturally produces reasonable uncertainty estimation as a\nbyproduct. Different from typical variational autoencoders assuming I.I.D.\nrepresentations (a data point's representation is not directly affected by\nother data points), our VIR borrows data with similar regression labels to\ncompute the latent representation's variational distribution; furthermore,\ndifferent from deterministic regression models producing point estimates, VIR\npredicts the entire normal-inverse-gamma distributions and modulates the\nassociated conjugate distributions to impose probabilistic reweighting on the\nimbalanced data, thereby providing better uncertainty estimation. Experiments\nin several real-world datasets show that our VIR can outperform\nstate-of-the-art imbalanced regression models in terms of both accuracy and\nuncertainty estimation. Code will soon be available at\n\\url{https://github.com/Wang-ML-Lab/variational-imbalanced-regression}.\n","authors":["Ziyan Wang","Hao Wang"],"pdf_url":"https://arxiv.org/pdf/2306.06599v4.pdf","comment":"Accepted at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.08049v1","updated":"2023-10-12T05:43:06Z","published":"2023-10-12T05:43:06Z","title":"Exploring the Relationship Between Model Architecture and In-Context\n Learning Ability","summary":" What is the relationship between model architecture and the ability to\nperform in-context learning? In this empirical study, we take the first steps\ntowards answering this question. In particular, we evaluate fifteen model\narchitectures across a suite of synthetic in-context learning tasks. The\nselected architectures represent a broad range of paradigms, including\nrecurrent and convolution-based neural networks, transformers, and emerging\nattention alternatives. We discover that all considered architectures can\nperform in-context learning under certain conditions. However, contemporary\narchitectures are found to be the best performing, especially as task\ncomplexity grows. Additionally, our follow-up experiments delve into various\nfactors that influence in-context learning. We observe varied sensitivities\namong architectures with respect to hyperparameter settings. Our study of\ntraining dynamics reveals that certain architectures exhibit a smooth,\nprogressive learning trajectory, while others demonstrate periods of stagnation\nfollowed by abrupt mastery of the task. Finally, and somewhat surprisingly, we\nfind that several emerging attention alternatives are more robust in-context\nlearners than transformers; since such approaches have constant-sized memory\nfootprints at inference time, this result opens the future possibility of\nscaling up in-context learning to vastly larger numbers of in-context examples.\n","authors":["Ivan Lee","Nan Jiang","Taylor Berg-Kirkpatrick"],"pdf_url":"https://arxiv.org/pdf/2310.08049v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.05624v2","updated":"2023-10-12T05:33:19Z","published":"2023-10-09T11:26:58Z","title":"Locality-Aware Generalizable Implicit Neural Representation","summary":" Generalizable implicit neural representation (INR) enables a single\ncontinuous function, i.e., a coordinate-based neural network, to represent\nmultiple data instances by modulating its weights or intermediate features\nusing latent codes. However, the expressive power of the state-of-the-art\nmodulation is limited due to its inability to localize and capture fine-grained\ndetails of data entities such as specific pixels and rays. To address this\nissue, we propose a novel framework for generalizable INR that combines a\ntransformer encoder with a locality-aware INR decoder. The transformer encoder\npredicts a set of latent tokens from a data instance to encode local\ninformation into each latent token. The locality-aware INR decoder extracts a\nmodulation vector by selectively aggregating the latent tokens via\ncross-attention for a coordinate input and then predicts the output by\nprogressively decoding with coarse-to-fine modulation through multiple\nfrequency bandwidths. The selective token aggregation and the multi-band\nfeature modulation enable us to learn locality-aware representation in spatial\nand spectral aspects, respectively. Our framework significantly outperforms\nprevious generalizable INRs and validates the usefulness of the locality-aware\nlatents for downstream tasks such as image generation.\n","authors":["Doyup Lee","Chiheon Kim","Minsu Cho","Wook-Shin Han"],"pdf_url":"https://arxiv.org/pdf/2310.05624v2.pdf","comment":"19 pages, 12 figures"},{"id":"http://arxiv.org/abs/2310.08041v1","updated":"2023-10-12T05:25:49Z","published":"2023-10-12T05:25:49Z","title":"QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large\n Language Models","summary":" Large Language Models (LLMs) excel in NLP, but their demands hinder their\nwidespread deployment. While Quantization-Aware Training (QAT) offers a\nsolution, its extensive training costs make Post-Training Quantization (PTQ) a\nmore practical approach for LLMs. In existing studies, activation outliers in\nparticular channels are identified as the bottleneck to PTQ accuracy. They\npropose to transform the magnitudes from activations to weights, which however\noffers limited alleviation or suffers from unstable gradients, resulting in a\nsevere performance drop at low-bitwidth. In this paper, we propose QLLM, an\naccurate and efficient low-bitwidth PTQ method designed for LLMs. QLLM\nintroduces an adaptive channel reassembly technique that reallocates the\nmagnitude of outliers to other channels, thereby mitigating their impact on the\nquantization range. This is achieved by channel disassembly and channel\nassembly, which first breaks down the outlier channels into several\nsub-channels to ensure a more balanced distribution of activation magnitudes.\nThen similar channels are merged to maintain the original channel number for\nefficiency. Additionally, an adaptive strategy is designed to autonomously\ndetermine the optimal number of sub-channels for channel disassembly. To\nfurther compensate for the performance loss caused by quantization, we propose\nan efficient tuning method that only learns a small number of low-rank weights\nwhile freezing the pre-trained quantized model. After training, these low-rank\nparameters can be fused into the frozen weights without affecting inference.\nExtensive experiments on LLaMA-1 and LLaMA-2 show that QLLM can obtain accurate\nquantized models efficiently. For example, QLLM quantizes the 4-bit LLaMA-2-70B\nwithin 10 hours on a single A100-80G GPU, outperforming the previous\nstate-of-the-art method by 7.89% on the average accuracy across five zero-shot\ntasks.\n","authors":["Jing Liu","Ruihao Gong","Xiuying Wei","Zhiwei Dong","Jianfei Cai","Bohan Zhuang"],"pdf_url":"https://arxiv.org/pdf/2310.08041v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2301.10902v2","updated":"2023-10-12T05:21:21Z","published":"2023-01-26T02:22:46Z","title":"Efficient Hyperdimensional Computing","summary":" Hyperdimensional computing (HDC) is a method to perform classification that\nuses binary vectors with high dimensions and the majority rule. This approach\nhas the potential to be energy-efficient and hence deemed suitable for\nresource-limited platforms due to its simplicity and massive parallelism.\nHowever, in order to achieve high accuracy, HDC sometimes uses hypervectors\nwith tens of thousands of dimensions. This potentially negates its efficiency\nadvantage. In this paper, we examine the necessity of such high dimensions and\nconduct a detailed theoretical analysis of the relationship between hypervector\ndimensions and accuracy. Our results demonstrate that as the dimension of the\nhypervectors increases, the worst-case/average-case HDC prediction accuracy\nwith the majority rule decreases. Building on this insight, we develop HDC\nmodels that use binary hypervectors with dimensions orders of magnitude lower\nthan those of state-of-the-art HDC models while maintaining equivalent or even\nimproved accuracy and efficiency. For instance, on the MNIST dataset, we\nachieve 91.12% HDC accuracy in image classification with a dimension of only\n64. Our methods perform operations that are only 0.35% of other HDC models with\ndimensions of 10,000. Furthermore, we evaluate our methods on ISOLET, UCI-HAR,\nand Fashion-MNIST datasets and investigate the limits of HDC computing.\n","authors":["Zhanglu Yan","Shida Wang","Kaiwen Tang","Weng-Fai Wong"],"pdf_url":"https://arxiv.org/pdf/2301.10902v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08040v1","updated":"2023-10-12T05:20:18Z","published":"2023-10-12T05:20:18Z","title":"SEE-OoD: Supervised Exploration For Enhanced Out-of-Distribution\n Detection","summary":" Current techniques for Out-of-Distribution (OoD) detection predominantly rely\non quantifying predictive uncertainty and incorporating model regularization\nduring the training phase, using either real or synthetic OoD samples. However,\nmethods that utilize real OoD samples lack exploration and are prone to overfit\nthe OoD samples at hand. Whereas synthetic samples are often generated based on\nfeatures extracted from training data, rendering them less effective when the\ntraining and OoD data are highly overlapped in the feature space. In this work,\nwe propose a Wasserstein-score-based generative adversarial training scheme to\nenhance OoD detection accuracy, which, for the first time, performs data\naugmentation and exploration simultaneously under the supervision of limited\nOoD samples. Specifically, the generator explores OoD spaces and generates\nsynthetic OoD samples using feedback from the discriminator, while the\ndiscriminator exploits both the observed and synthesized samples for OoD\ndetection using a predefined Wasserstein score. We provide theoretical\nguarantees that the optimal solutions of our generative scheme are\nstatistically achievable through adversarial training in empirical settings. We\nthen demonstrate that the proposed method outperforms state-of-the-art\ntechniques on various computer vision datasets and exhibits superior\ngeneralizability to unseen OoD data.\n","authors":["Xiaoyang Song","Wenbo Sun","Maher Nouiehed","Raed Al Kontar","Judy Jin"],"pdf_url":"https://arxiv.org/pdf/2310.08040v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07297v2","updated":"2023-10-12T05:15:51Z","published":"2023-10-11T08:31:26Z","title":"Score Regularized Policy Optimization through Diffusion Behavior","summary":" Recent developments in offline reinforcement learning have uncovered the\nimmense potential of diffusion modeling, which excels at representing\nheterogeneous behavior policies. However, sampling from diffusion policies is\nconsiderably slow because it necessitates tens to hundreds of iterative\ninference steps for one action. To address this issue, we propose to extract an\nefficient deterministic inference policy from critic models and pretrained\ndiffusion behavior models, leveraging the latter to directly regularize the\npolicy gradient with the behavior distribution's score function during\noptimization. Our method enjoys powerful generative capabilities of diffusion\nmodeling while completely circumventing the computationally intensive and\ntime-consuming diffusion sampling scheme, both during training and evaluation.\nExtensive results on D4RL tasks show that our method boosts action sampling\nspeed by more than 25 times compared with various leading diffusion-based\nmethods in locomotion tasks, while still maintaining state-of-the-art\nperformance.\n","authors":["Huayu Chen","Cheng Lu","Zhengyi Wang","Hang Su","Jun Zhu"],"pdf_url":"https://arxiv.org/pdf/2310.07297v2.pdf","comment":"18 pages"},{"id":"http://arxiv.org/abs/2310.08039v1","updated":"2023-10-12T05:14:42Z","published":"2023-10-12T05:14:42Z","title":"Rethinking Large-scale Pre-ranking System: Entire-chain Cross-domain\n Models","summary":" Industrial systems such as recommender systems and online advertising, have\nbeen widely equipped with multi-stage architectures, which are divided into\nseveral cascaded modules, including matching, pre-ranking, ranking and\nre-ranking. As a critical bridge between matching and ranking, existing\npre-ranking approaches mainly endure sample selection bias (SSB) problem owing\nto ignoring the entire-chain data dependence, resulting in sub-optimal\nperformances. In this paper, we rethink pre-ranking system from the perspective\nof the entire sample space, and propose Entire-chain Cross-domain Models (ECM),\nwhich leverage samples from the whole cascaded stages to effectively alleviate\nSSB problem. Besides, we design a fine-grained neural structure named ECMM to\nfurther improve the pre-ranking accuracy. Specifically, we propose a\ncross-domain multi-tower neural network to comprehensively predict for each\nstage result, and introduce the sub-networking routing strategy with $L0$\nregularization to reduce computational costs. Evaluations on real-world\nlarge-scale traffic logs demonstrate that our pre-ranking models outperform\nSOTA methods while time consumption is maintained within an acceptable level,\nwhich achieves better trade-off between efficiency and effectiveness.\n","authors":["Jinbo Song","Ruoran Huang","Xinyang Wang","Wei Huang","Qian Yu","Mingming Chen","Yafei Yao","Chaosheng Fan","Changping Peng","Zhangang Lin","Jinghe Hu","Jingping Shao"],"pdf_url":"https://arxiv.org/pdf/2310.08039v1.pdf","comment":"5 pages, 2 figures"},{"id":"http://arxiv.org/abs/2310.08038v1","updated":"2023-10-12T05:09:27Z","published":"2023-10-12T05:09:27Z","title":"Continual Learning via Manifold Expansion Replay","summary":" In continual learning, the learner learns multiple tasks in sequence, with\ndata being acquired only once for each task. Catastrophic forgetting is a major\nchallenge to continual learning. To reduce forgetting, some existing\nrehearsal-based methods use episodic memory to replay samples of previous\ntasks. However, in the process of knowledge integration when learning a new\ntask, this strategy also suffers from catastrophic forgetting due to an\nimbalance between old and new knowledge. To address this problem, we propose a\nnovel replay strategy called Manifold Expansion Replay (MaER). We argue that\nexpanding the implicit manifold of the knowledge representation in the episodic\nmemory helps to improve the robustness and expressiveness of the model. To this\nend, we propose a greedy strategy to keep increasing the diameter of the\nimplicit manifold represented by the knowledge in the buffer during memory\nmanagement. In addition, we introduce Wasserstein distance instead of cross\nentropy as distillation loss to preserve previous knowledge. With extensive\nexperimental validation on MNIST, CIFAR10, CIFAR100, and TinyImageNet, we show\nthat the proposed method significantly improves the accuracy in continual\nlearning setup, outperforming the state of the arts.\n","authors":["Zihao Xu","Xuan Tang","Yufei Shi","Jianfeng Zhang","Jian Yang","Mingsong Chen","Xian Wei"],"pdf_url":"https://arxiv.org/pdf/2310.08038v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08036v1","updated":"2023-10-12T05:08:21Z","published":"2023-10-12T05:08:21Z","title":"ZEST: Attention-based Zero-Shot Learning for Unseen IoT Device\n Classification","summary":" Recent research works have proposed machine learning models for classifying\nIoT devices connected to a network. However, there is still a practical\nchallenge of not having all devices (and hence their traffic) available during\nthe training of a model. This essentially means, during the operational phase,\nwe need to classify new devices not seen during the training phase. To address\nthis challenge, we propose ZEST -- a ZSL (zero-shot learning) framework based\non self-attention for classifying both seen and unseen devices. ZEST consists\nof i) a self-attention based network feature extractor, termed SANE, for\nextracting latent space representations of IoT traffic, ii) a generative model\nthat trains a decoder using latent features to generate pseudo data, and iii) a\nsupervised model that is trained on the generated pseudo data for classifying\ndevices. We carry out extensive experiments on real IoT traffic data; our\nexperiments demonstrate i) ZEST achieves significant improvement (in terms of\naccuracy) over the baselines; ii) ZEST is able to better extract meaningful\nrepresentations than LSTM which has been commonly used for modeling network\ntraffic.\n","authors":["Binghui Wu","Philipp Gysel","Dinil Mon Divakaran","Mohan Gurusamy"],"pdf_url":"https://arxiv.org/pdf/2310.08036v1.pdf","comment":"9 pages, 6 figures, 3 tables"},{"id":"http://arxiv.org/abs/2306.03410v2","updated":"2023-10-12T04:57:23Z","published":"2023-06-06T05:17:02Z","title":"Learning to Simulate Tree-Branch Dynamics for Manipulation","summary":" We propose to use a simulation driven inverse inference approach to model the\ndynamics of tree branches under manipulation. Learning branch dynamics and\ngaining the ability to manipulate deformable vegetation can help with\nocclusion-prone tasks, such as fruit picking in dense foliage, as well as\nmoving overhanging vines and branches for navigation in dense vegetation. The\nunderlying deformable tree geometry is encapsulated as coarse spring\nabstractions executed on parallel, non-differentiable simulators. The implicit\nstatistical model defined by the simulator, reference trajectories obtained by\nactively probing the ground truth, and the Bayesian formalism, together guide\nthe spring parameter posterior density estimation. Our non-parametric inference\nalgorithm, based on Stein Variational Gradient Descent, incorporates\nbiologically motivated assumptions into the inference process as neural network\ndriven learnt joint priors; moreover, it leverages the finite difference scheme\nfor gradient approximations. Real and simulated experiments confirm that our\nmodel can predict deformation trajectories, quantify the estimation\nuncertainty, and it can perform better when base-lined against other inference\nalgorithms, particularly from the Monte Carlo family. The model displays strong\nrobustness properties in the presence of heteroscedastic sensor noise;\nfurthermore, it can generalise to unseen grasp locations.\n","authors":["Jayadeep Jacob","Tirthankar Bandyopadhyay","Jason Williams","Paulo Borges","Fabio Ramos"],"pdf_url":"https://arxiv.org/pdf/2306.03410v2.pdf","comment":"8 pages, 7 figures"},{"id":"http://arxiv.org/abs/2310.08031v1","updated":"2023-10-12T04:37:15Z","published":"2023-10-12T04:37:15Z","title":"Local Graph Clustering with Noisy Labels","summary":" The growing interest in machine learning problems over graphs with additional\nnode information such as texts, images, or labels has popularized methods that\nrequire the costly operation of processing the entire graph. Yet, little effort\nhas been made to the development of fast local methods (i.e. without accessing\nthe entire graph) that extract useful information from such data. To that end,\nwe propose a study of local graph clustering using noisy node labels as a proxy\nfor additional node information. In this setting, nodes receive initial binary\nlabels based on cluster affiliation: 1 if they belong to the target cluster and\n0 otherwise. Subsequently, a fraction of these labels is flipped. We\ninvestigate the benefits of incorporating noisy labels for local graph\nclustering. By constructing a weighted graph with such labels, we study the\nperformance of graph diffusion-based local clustering method on both the\noriginal and the weighted graphs. From a theoretical perspective, we consider\nrecovering an unknown target cluster with a single seed node in a random graph\nwith independent noisy node labels. We provide sufficient conditions on the\nlabel noise under which, with high probability, using diffusion in the weighted\ngraph yields a more accurate recovery of the target cluster. This approach\nproves more effective than using the given labels alone or using diffusion in\nthe label-free original graph. Empirically, we show that reliable node labels\ncan be obtained with just a few samples from an attributed graph. Moreover,\nutilizing these labels via diffusion in the weighted graph leads to\nsignificantly better local clustering performance across several real-world\ndatasets, improving F1 scores by up to 13%.\n","authors":["Artur Back de Luca","Kimon Fountoulakis","Shenghao Yang"],"pdf_url":"https://arxiv.org/pdf/2310.08031v1.pdf","comment":"26 pages, 5 figures, 14 tables"},{"id":"http://arxiv.org/abs/2309.10691v2","updated":"2023-10-12T04:07:56Z","published":"2023-09-19T15:25:42Z","title":"MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language\n Feedback","summary":" To solve complex tasks, large language models (LLMs) often require multiple\nrounds of interactions with the user, sometimes assisted by external tools.\nHowever, current evaluation protocols often emphasize benchmark performance\nwith single-turn exchanges, neglecting the nuanced interactions among the user,\nLLMs, and external tools, while also underestimating the importance of natural\nlanguage feedback from users. These oversights contribute to discrepancies\nbetween research benchmark evaluations and real-world use cases. We introduce\nMINT, a benchmark that evaluates LLMs' ability to solve tasks with multi-turn\ninteractions by (1) using tools and (2) leveraging natural language feedback.\nTo ensure reproducibility, we provide an evaluation framework where LLMs can\naccess tools by executing Python code and receive users' natural language\nfeedback simulated by GPT-4. We repurpose a diverse set of established\nevaluation datasets focusing on reasoning, coding, and decision-making and\ncarefully curate them into a compact subset for efficient evaluation. Our\nanalysis of 20 open- and closed-source LLMs offers intriguing findings. (a)\nLLMs generally benefit from tools and language feedback, with performance gains\n(absolute, same below) of 1-8% for each turn of tool use and 2-17% with natural\nlanguage feedback. (b) Better single-turn performance does not guarantee better\nmulti-turn performance. (c) Surprisingly, on the LLMs evaluated, supervised\ninstruction-finetuning (SIFT) and reinforcement learning from human feedback\n(RLHF) generally hurt multi-turn capabilities. We expect MINT can help measure\nprogress and incentivize research in improving LLMs' capabilities in multi-turn\ninteractions, especially for open-source communities where multi-turn human\nevaluation can be less accessible compared to commercial LLMs with a larger\nuser base.\n","authors":["Xingyao Wang","Zihan Wang","Jiateng Liu","Yangyi Chen","Lifan Yuan","Hao Peng","Heng Ji"],"pdf_url":"https://arxiv.org/pdf/2309.10691v2.pdf","comment":"Code is available on our project website:\n https://xingyaoww.github.io/mint-bench"},{"id":"http://arxiv.org/abs/2307.02484v5","updated":"2023-10-12T04:06:48Z","published":"2023-07-05T17:58:21Z","title":"Elastic Decision Transformer","summary":" This paper introduces Elastic Decision Transformer (EDT), a significant\nadvancement over the existing Decision Transformer (DT) and its variants.\nAlthough DT purports to generate an optimal trajectory, empirical evidence\nsuggests it struggles with trajectory stitching, a process involving the\ngeneration of an optimal or near-optimal trajectory from the best parts of a\nset of sub-optimal trajectories. The proposed EDT differentiates itself by\nfacilitating trajectory stitching during action inference at test time,\nachieved by adjusting the history length maintained in DT. Further, the EDT\noptimizes the trajectory by retaining a longer history when the previous\ntrajectory is optimal and a shorter one when it is sub-optimal, enabling it to\n\"stitch\" with a more optimal trajectory. Extensive experimentation demonstrates\nEDT's ability to bridge the performance gap between DT-based and Q\nLearning-based approaches. In particular, the EDT outperforms Q Learning-based\nmethods in a multi-task regime on the D4RL locomotion benchmark and Atari\ngames. Videos are available at: https://kristery.github.io/edt/\n","authors":["Yueh-Hua Wu","Xiaolong Wang","Masashi Hamaya"],"pdf_url":"https://arxiv.org/pdf/2307.02484v5.pdf","comment":"Accepted to NeurIPS 2023"},{"id":"http://arxiv.org/abs/2210.15889v4","updated":"2023-10-12T04:05:41Z","published":"2022-10-28T04:38:10Z","title":"Towards Data-and Knowledge-Driven Artificial Intelligence: A Survey on\n Neuro-Symbolic Computing","summary":" Neural-symbolic computing (NeSy), which pursues the integration of the\nsymbolic and statistical paradigms of cognition, has been an active research\narea of Artificial Intelligence (AI) for many years. As NeSy shows promise of\nreconciling the advantages of reasoning and interpretability of symbolic\nrepresentation and robust learning in neural networks, it may serve as a\ncatalyst for the next generation of AI. In the present paper, we provide a\nsystematic overview of the recent developments and important contributions of\nNeSy research. Firstly, we introduce study history of this area, covering early\nwork and foundations. We further discuss background concepts and identify key\ndriving factors behind the development of NeSy. Afterward, we categorize recent\nlandmark approaches along several main characteristics that underline this\nresearch paradigm, including neural-symbolic integration, knowledge\nrepresentation, knowledge embedding, and functionality. Next, we briefly\ndiscuss the successful application of modern NeSy approaches in several\ndomains. Then, we benchmark several NeSy methods on three representative\napplication tasks. Finally, we identify the open problems together with\npotential future research directions. This survey is expected to help new\nresearchers enter this rapidly evolving field and accelerate the progress\ntowards data-and knowledge-driven AI.\n","authors":["Wenguan Wang","Yi Yang","Fei Wu"],"pdf_url":"https://arxiv.org/pdf/2210.15889v4.pdf","comment":"Ongoing project"},{"id":"http://arxiv.org/abs/2302.02931v2","updated":"2023-10-12T03:47:00Z","published":"2023-02-06T17:07:16Z","title":"Bitrate-Constrained DRO: Beyond Worst Case Robustness To Unknown Group\n Shifts","summary":" Training machine learning models robust to distribution shifts is critical\nfor real-world applications. Some robust training algorithms (e.g., Group DRO)\nspecialize to group shifts and require group information on all training\npoints. Other methods (e.g., CVaR DRO) that do not need group annotations can\nbe overly conservative, since they naively upweight high loss points which may\nform a contrived set that does not correspond to any meaningful group in the\nreal world (e.g., when the high loss points are randomly mislabeled training\npoints). In this work, we address limitations in prior approaches by assuming a\nmore nuanced form of group shift: conditioned on the label, we assume that the\ntrue group function (indicator over group) is simple. For example, we may\nexpect that group shifts occur along low bitrate features (e.g., image\nbackground, lighting). Thus, we aim to learn a model that maintains high\naccuracy on simple group functions realized by these low bitrate features, that\nneed not spend valuable model capacity achieving high accuracy on contrived\ngroups of examples. Based on this, we consider the two-player game formulation\nof DRO where the adversary's capacity is bitrate-constrained. Our resulting\npractical algorithm, Bitrate-Constrained DRO (BR-DRO), does not require group\ninformation on training samples yet matches the performance of Group DRO on\ndatasets that have training group annotations and that of CVaR DRO on\nlong-tailed distributions. Our theoretical analysis reveals that in some\nsettings BR-DRO objective can provably yield statistically efficient and less\nconservative solutions than unconstrained CVaR DRO.\n","authors":["Amrith Setlur","Don Dennis","Benjamin Eysenbach","Aditi Raghunathan","Chelsea Finn","Virginia Smith","Sergey Levine"],"pdf_url":"https://arxiv.org/pdf/2302.02931v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08019v1","updated":"2023-10-12T03:41:32Z","published":"2023-10-12T03:41:32Z","title":"Robust 1-bit Compressed Sensing with Iterative Hard Thresholding","summary":" In 1-bit compressed sensing, the aim is to estimate a $k$-sparse unit vector\n$x\\in S^{n-1}$ within an $\\epsilon$ error (in $\\ell_2$) from minimal number of\nlinear measurements that are quantized to just their signs, i.e., from\nmeasurements of the form $y = \\mathrm{Sign}(\\langle a, x\\rangle).$ In this\npaper, we study a noisy version where a fraction of the measurements can be\nflipped, potentially by an adversary. In particular, we analyze the Binary\nIterative Hard Thresholding (BIHT) algorithm, a proximal gradient descent on a\nproperly defined loss function used for 1-bit compressed sensing, in this noisy\nsetting. It is known from recent results that, with\n$\\tilde{O}(\\frac{k}{\\epsilon})$ noiseless measurements, BIHT provides an\nestimate within $\\epsilon$ error. This result is optimal and universal, meaning\none set of measurements work for all sparse vectors. In this paper, we show\nthat BIHT also provides better results than all known methods for the noisy\nsetting. We show that when up to $\\tau$-fraction of the sign measurements are\nincorrect (adversarial error), with the same number of measurements as before,\nBIHT agnostically provides an estimate of $x$ within an\n$\\tilde{O}(\\epsilon+\\tau)$ error, maintaining the universality of measurements.\nThis establishes stability of iterative hard thresholding in the presence of\nmeasurement error. To obtain the result, we use the restricted approximate\ninvertibility of Gaussian matrices, as well as a tight analysis of the\nhigh-dimensional geometry of the adversarially corrupted measurements.\n","authors":["Namiko Matsumoto","Arya Mazumdar"],"pdf_url":"https://arxiv.org/pdf/2310.08019v1.pdf","comment":"Accepted to appear in ACM-SIAM Symposium on Discrete Algorithms\n (SODA) 2024"},{"id":"http://arxiv.org/abs/2308.03807v2","updated":"2023-10-12T03:36:17Z","published":"2023-08-06T15:47:03Z","title":"Nest-DGIL: Nesterov-optimized Deep Geometric Incremental Learning for CS\n Image Reconstruction","summary":" Proximal gradient-based optimization is one of the most common strategies to\nsolve inverse problem of images, and it is easy to implement. However, these\ntechniques often generate heavy artifacts in image reconstruction. One of the\nmost popular refinement methods is to fine-tune the regularization parameter to\nalleviate such artifacts, but it may not always be sufficient or applicable due\nto increased computational costs. In this work, we propose a deep geometric\nincremental learning framework based on the second Nesterov proximal gradient\noptimization. The proposed end-to-end network not only has the powerful\nlearning ability for high-/low-frequency image features, but also can\ntheoretically guarantee that geometric texture details will be reconstructed\nfrom preliminary linear reconstruction. Furthermore, it can avoid the risk of\nintermediate reconstruction results falling outside the geometric decomposition\ndomains and achieve fast convergence. Our reconstruction framework is\ndecomposed into four modules including general linear reconstruction, cascade\ngeometric incremental restoration, Nesterov acceleration, and post-processing.\nIn the image restoration step, a cascade geometric incremental learning module\nis designed to compensate for missing texture information from different\ngeometric spectral decomposition domains. Inspired by the overlap-tile\nstrategy, we also develop a post-processing module to remove the block effect\nin patch-wise-based natural image reconstruction. All parameters in the\nproposed model are learnable, an adaptive initialization technique of physical\nparameters is also employed to make model flexibility and ensure converging\nsmoothly. We compare the reconstruction performance of the proposed method with\nexisting state-of-the-art methods to demonstrate its superiority. Our source\ncodes are available at https://github.com/fanxiaohong/Nest-DGIL.\n","authors":["Xiaohong Fan","Yin Yang","Ke Chen","Yujie Feng","Jianping Zhang"],"pdf_url":"https://arxiv.org/pdf/2308.03807v2.pdf","comment":"15 pages,our source codes are available at\n https://github.com/fanxiaohong/Nest-DGIL"},{"id":"http://arxiv.org/abs/2310.07644v2","updated":"2023-10-12T03:32:32Z","published":"2023-10-11T16:40:57Z","title":"Rethinking the BERT-like Pretraining for DNA Sequences","summary":" With the success of large-scale pretraining in NLP, there is an increasing\ntrend of applying it to the domain of life sciences. In particular, pretraining\nmethods based on DNA sequences have garnered growing attention due to their\npotential to capture generic information about genes. However, existing\npretraining methods for DNA sequences largely rely on direct adoptions of BERT\npretraining from NLP, lacking a comprehensive understanding and a specifically\ntailored approach. To address this research gap, we first conducted a series of\nexploratory experiments and gained several insightful observations: 1) In the\nfine-tuning phase of downstream tasks, when using K-mer overlapping\ntokenization instead of K-mer non-overlapping tokenization, both overlapping\nand non-overlapping pretraining weights show consistent performance\nimprovement.2) During the pre-training process, using K-mer overlapping\ntokenization quickly produces clear K-mer embeddings and reduces the loss to a\nvery low level, while using K-mer non-overlapping tokenization results in less\ndistinct embeddings and continuously decreases the loss. 3) Using overlapping\ntokenization causes the self-attention in the intermediate layers of\npre-trained models to tend to overly focus on certain tokens, reflecting that\nthese layers are not adequately optimized. In summary, overlapping tokenization\ncan benefit the fine-tuning of downstream tasks but leads to inadequate\npretraining with fast convergence. To unleash the pretraining potential, we\nintroduce a novel approach called RandomMask, which gradually increases the\ntask difficulty of BERT-like pretraining by continuously expanding its mask\nboundary, forcing the model to learn more knowledge. RandomMask is simple but\neffective, achieving top-tier performance across 26 datasets of 28 datasets\nspanning 7 downstream tasks.\n","authors":["Chaoqi Liang","Weiqiang Bai","Lifeng Qiao","Yuchen Ren","Jianle Sun","Peng Ye","Hongliang Yan","Xinzhu Ma","Wangmeng Zuo","Wanli Ouyang"],"pdf_url":"https://arxiv.org/pdf/2310.07644v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08015v1","updated":"2023-10-12T03:29:53Z","published":"2023-10-12T03:29:53Z","title":"Why Train More? Effective and Efficient Membership Inference via\n Memorization","summary":" Membership Inference Attacks (MIAs) aim to identify specific data samples\nwithin the private training dataset of machine learning models, leading to\nserious privacy violations and other sophisticated threats. Many practical\nblack-box MIAs require query access to the data distribution (the same\ndistribution where the private data is drawn) to train shadow models. By doing\nso, the adversary obtains models trained \"with\" or \"without\" samples drawn from\nthe distribution, and analyzes the characteristics of the samples under\nconsideration. The adversary is often required to train more than hundreds of\nshadow models to extract the signals needed for MIAs; this becomes the\ncomputational overhead of MIAs. In this paper, we propose that by strategically\nchoosing the samples, MI adversaries can maximize their attack success while\nminimizing the number of shadow models. First, our motivational experiments\nsuggest memorization as the key property explaining disparate sample\nvulnerability to MIAs. We formalize this through a theoretical bound that\nconnects MI advantage with memorization. Second, we show sample complexity\nbounds that connect the number of shadow models needed for MIAs with\nmemorization. Lastly, we confirm our theoretical arguments with comprehensive\nexperiments; by utilizing samples with high memorization scores, the adversary\ncan (a) significantly improve its efficacy regardless of the MIA used, and (b)\nreduce the number of shadow models by nearly two orders of magnitude compared\nto state-of-the-art approaches.\n","authors":["Jihye Choi","Shruti Tople","Varun Chandrasekaran","Somesh Jha"],"pdf_url":"https://arxiv.org/pdf/2310.08015v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08012v1","updated":"2023-10-12T03:28:14Z","published":"2023-10-12T03:28:14Z","title":"AutoFHE: Automated Adaption of CNNs for Efficient Evaluation over FHE","summary":" Secure inference of deep convolutional neural networks (CNNs) under RNS-CKKS\ninvolves polynomial approximation of unsupported non-linear activation\nfunctions. However, existing approaches have three main limitations: 1)\nInflexibility: The polynomial approximation and associated homomorphic\nevaluation architecture are customized manually for each CNN architecture and\ndo not generalize to other networks. 2) Suboptimal Approximation: Each\nactivation function is approximated instead of the function represented by the\nCNN. 3) Restricted Design: Either high-degree or low-degree polynomial\napproximations are used. The former retains high accuracy but slows down\ninference due to bootstrapping operations, while the latter accelerates\nciphertext inference but compromises accuracy. To address these limitations, we\npresent AutoFHE, which automatically adapts standard CNNs for secure inference\nunder RNS-CKKS. The key idea is to adopt layerwise mixed-degree polynomial\nactivation functions, which are optimized jointly with the homomorphic\nevaluation architecture in terms of the placement of bootstrapping operations.\nThe problem is modeled within a multi-objective optimization framework to\nmaximize accuracy and minimize the number of bootstrapping operations. AutoFHE\ncan be applied flexibly on any CNN architecture, and it provides diverse\nsolutions that span the trade-off between accuracy and latency. Experimental\nevaluation over RNS-CKKS encrypted CIFAR datasets shows that AutoFHE\naccelerates secure inference by $1.32\\times$ to $1.8\\times$ compared to methods\nemploying high-degree polynomials. It also improves accuracy by up to 2.56%\ncompared to methods using low-degree polynomials. Lastly, AutoFHE accelerates\ninference and improves accuracy by $103\\times$ and 3.46%, respectively,\ncompared to CNNs under TFHE.\n","authors":["Wei Ao","Vishnu Naresh Boddeti"],"pdf_url":"https://arxiv.org/pdf/2310.08012v1.pdf","comment":"USENIX Security Symposium 2024"},{"id":"http://arxiv.org/abs/2310.06763v2","updated":"2023-10-12T03:24:43Z","published":"2023-10-10T16:39:47Z","title":"FABind: Fast and Accurate Protein-Ligand Binding","summary":" Modeling the interaction between proteins and ligands and accurately\npredicting their binding structures is a critical yet challenging task in drug\ndiscovery. Recent advancements in deep learning have shown promise in\naddressing this challenge, with sampling-based and regression-based methods\nemerging as two prominent approaches. However, these methods have notable\nlimitations. Sampling-based methods often suffer from low efficiency due to the\nneed for generating multiple candidate structures for selection. On the other\nhand, regression-based methods offer fast predictions but may experience\ndecreased accuracy. Additionally, the variation in protein sizes often requires\nexternal modules for selecting suitable binding pockets, further impacting\nefficiency. In this work, we propose $\\mathbf{FABind}$, an end-to-end model\nthat combines pocket prediction and docking to achieve accurate and fast\nprotein-ligand binding. $\\mathbf{FABind}$ incorporates a unique ligand-informed\npocket prediction module, which is also leveraged for docking pose estimation.\nThe model further enhances the docking process by incrementally integrating the\npredicted pocket to optimize protein-ligand binding, reducing discrepancies\nbetween training and inference. Through extensive experiments on benchmark\ndatasets, our proposed $\\mathbf{FABind}$ demonstrates strong advantages in\nterms of effectiveness and efficiency compared to existing methods. Our code is\navailable at $\\href{https://github.com/QizhiPei/FABind}{Github}$.\n","authors":["Qizhi Pei","Kaiyuan Gao","Lijun Wu","Jinhua Zhu","Yingce Xia","Shufang Xie","Tao Qin","Kun He","Tie-Yan Liu","Rui Yan"],"pdf_url":"https://arxiv.org/pdf/2310.06763v2.pdf","comment":"Neural Information Processing Systems (NeurIPS 2023)"},{"id":"http://arxiv.org/abs/2310.06488v2","updated":"2023-10-12T03:23:40Z","published":"2023-10-10T09:57:17Z","title":"SpikeCLIP: A Contrastive Language-Image Pretrained Spiking Neural\n Network","summary":" Spiking neural networks (SNNs) have demonstrated the capability to achieve\ncomparable performance to deep neural networks (DNNs) in both visual and\nlinguistic domains while offering the advantages of improved energy efficiency\nand adherence to biological plausibility. However, the extension of such\nsingle-modality SNNs into the realm of multimodal scenarios remains an\nunexplored territory. Drawing inspiration from the concept of contrastive\nlanguage-image pre-training (CLIP), we introduce a novel framework, named\nSpikeCLIP, to address the gap between two modalities within the context of\nspike-based computing through a two-step recipe involving ``Alignment\nPre-training + Dual-Loss Fine-tuning\". Extensive experiments demonstrate that\nSNNs achieve comparable results to their DNN counterparts while significantly\nreducing energy consumption across a variety of datasets commonly used for\nmultimodal model evaluation. Furthermore, SpikeCLIP maintains robust\nperformance in image classification tasks that involve class labels not\npredefined within specific categories.\n","authors":["Tianlong Li","Wenhao Liu","Changze Lv","Jianhan Xu","Cenyuan Zhang","Muling Wu","Xiaoqing Zheng","Xuanjing Huang"],"pdf_url":"https://arxiv.org/pdf/2310.06488v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.02285v2","updated":"2023-10-12T03:05:36Z","published":"2023-09-05T14:45:27Z","title":"PromptTTS 2: Describing and Generating Voices with Text Prompt","summary":" Speech conveys more information than text, as the same word can be uttered in\nvarious voices to convey diverse information. Compared to traditional\ntext-to-speech (TTS) methods relying on speech prompts (reference speech) for\nvoice variability, using text prompts (descriptions) is more user-friendly\nsince speech prompts can be hard to find or may not exist at all. TTS\napproaches based on the text prompt face two main challenges: 1) the\none-to-many problem, where not all details about voice variability can be\ndescribed in the text prompt, and 2) the limited availability of text prompt\ndatasets, where vendors and large cost of data labeling are required to write\ntext prompts for speech. In this work, we introduce PromptTTS 2 to address\nthese challenges with a variation network to provide variability information of\nvoice not captured by text prompts, and a prompt generation pipeline to utilize\nthe large language models (LLM) to compose high quality text prompts.\nSpecifically, the variation network predicts the representation extracted from\nthe reference speech (which contains full information about voice variability)\nbased on the text prompt representation. For the prompt generation pipeline, it\ngenerates text prompts for speech with a speech language understanding model to\nrecognize voice attributes (e.g., gender, speed) from speech and a large\nlanguage model to formulate text prompts based on the recognition results.\nExperiments on a large-scale (44K hours) speech dataset demonstrate that\ncompared to the previous works, PromptTTS 2 generates voices more consistent\nwith text prompts and supports the sampling of diverse voice variability,\nthereby offering users more choices on voice generation. Additionally, the\nprompt generation pipeline produces high-quality text prompts, eliminating the\nlarge labeling cost. The demo page of PromptTTS 2 is available online.\n","authors":["Yichong Leng","Zhifang Guo","Kai Shen","Xu Tan","Zeqian Ju","Yanqing Liu","Yufei Liu","Dongchao Yang","Leying Zhang","Kaitao Song","Lei He","Xiang-Yang Li","Sheng Zhao","Tao Qin","Jiang Bian"],"pdf_url":"https://arxiv.org/pdf/2309.02285v2.pdf","comment":"Demo page: https://speechresearch.github.io/prompttts2"},{"id":"http://arxiv.org/abs/2309.09408v2","updated":"2023-10-12T23:55:38Z","published":"2023-09-18T00:22:59Z","title":"Guided Online Distillation: Promoting Safe Reinforcement Learning by\n Offline Demonstration","summary":" Safe Reinforcement Learning (RL) aims to find a policy that achieves high\nrewards while satisfying cost constraints. When learning from scratch, safe RL\nagents tend to be overly conservative, which impedes exploration and restrains\nthe overall performance. In many realistic tasks, e.g. autonomous driving,\nlarge-scale expert demonstration data are available. We argue that extracting\nexpert policy from offline data to guide online exploration is a promising\nsolution to mitigate the conserveness issue. Large-capacity models, e.g.\ndecision transformers (DT), have been proven to be competent in offline policy\nlearning. However, data collected in real-world scenarios rarely contain\ndangerous cases (e.g., collisions), which makes it prohibitive for the policies\nto learn safety concepts. Besides, these bulk policy networks cannot meet the\ncomputation speed requirements at inference time on real-world tasks such as\nautonomous driving. To this end, we propose Guided Online Distillation (GOLD),\nan offline-to-online safe RL framework. GOLD distills an offline DT policy into\na lightweight policy network through guided online safe RL training, which\noutperforms both the offline DT policy and online safe RL algorithms.\nExperiments in both benchmark safe RL tasks and real-world driving tasks based\non the Waymo Open Motion Dataset (WOMD) demonstrate that GOLD can successfully\ndistill lightweight policies and solve decision-making problems in challenging\nsafety-critical scenarios.\n","authors":["Jinning Li","Xinyi Liu","Banghua Zhu","Jiantao Jiao","Masayoshi Tomizuka","Chen Tang","Wei Zhan"],"pdf_url":"https://arxiv.org/pdf/2309.09408v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2301.08141v2","updated":"2023-10-12T23:50:05Z","published":"2023-01-11T22:47:12Z","title":"Self-supervised Learning for Segmentation and Quantification of Dopamine\n Neurons in Parkinson's Disease","summary":" Parkinson's Disease (PD) is the second most common neurodegenerative disease\nin humans. PD is characterized by the gradual loss of dopaminergic neurons in\nthe Substantia Nigra (SN). Counting the number of dopaminergic neurons in the\nSN is one of the most important indexes in evaluating drug efficacy in PD\nanimal models. Currently, analyzing and quantifying dopaminergic neurons is\nconducted manually by experts through analysis of digital pathology images\nwhich is laborious, time-consuming, and highly subjective. As such, a reliable\nand unbiased automated system is demanded for the quantification of\ndopaminergic neurons in digital pathology images. Recent years have seen a\nsurge in adopting deep learning solutions in medical image processing. However,\ndeveloping high-performing deep learning models hinges on the availability of\nlarge-scale, high-quality annotated data, which can be expensive to acquire,\nespecially in applications like digital pathology image analysis. To this end,\nwe propose an end-to-end deep learning framework based on self-supervised\nlearning for the segmentation and quantification of dopaminergic neurons in PD\nanimal models. To the best of our knowledge, this is the first deep learning\nmodel that detects the cell body of dopaminergic neurons, counts the number of\ndopaminergic neurons, and provides characteristics of individual dopaminergic\nneurons as a numerical output. Extensive experiments demonstrate the\neffectiveness of our model in quantifying neurons with high precision, which\ncan provide a faster turnaround for drug efficacy studies, better understanding\nof dopaminergic neuronal health status, and unbiased results in PD pre-clinical\nresearch. As part of our contributions, we also provide the first publicly\navailable dataset of histology digital images along with expert annotations for\nthe segmentation of TH-positive DA neuronal soma.\n","authors":["Fatemeh Haghighi","Soumitra Ghosh","Hai Ngu","Sarah Chu","Han Lin","Mohsen Hejrati","Baris Bingol","Somaye Hashemifar"],"pdf_url":"https://arxiv.org/pdf/2301.08141v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08775v1","updated":"2023-10-12T23:47:22Z","published":"2023-10-12T23:47:22Z","title":"When Machine Learning Models Leak: An Exploration of Synthetic Training\n Data","summary":" We investigate an attack on a machine learning model that predicts whether a\nperson or household will relocate in the next two years, i.e., a\npropensity-to-move classifier. The attack assumes that the attacker can query\nthe model to obtain predictions and that the marginal distribution of the data\non which the model was trained is publicly available. The attack also assumes\nthat the attacker has obtained the values of non-sensitive attributes for a\ncertain number of target individuals. The objective of the attack is to infer\nthe values of sensitive attributes for these target individuals. We explore how\nreplacing the original data with synthetic data when training the model impacts\nhow successfully the attacker can infer sensitive attributes.\\footnote{Original\npaper published at PSD 2022. The paper was subsequently updated.}\n","authors":["Manel Slokom","Peter-Paul de Wolf","Martha Larson"],"pdf_url":"https://arxiv.org/pdf/2310.08775v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08774v1","updated":"2023-10-12T23:46:08Z","published":"2023-10-12T23:46:08Z","title":"PhyloGFN: Phylogenetic inference with generative flow networks","summary":" Phylogenetics is a branch of computational biology that studies the\nevolutionary relationships among biological entities. Its long history and\nnumerous applications notwithstanding, inference of phylogenetic trees from\nsequence data remains challenging: the high complexity of tree space poses a\nsignificant obstacle for the current combinatorial and probabilistic\ntechniques. In this paper, we adopt the framework of generative flow networks\n(GFlowNets) to tackle two core problems in phylogenetics: parsimony-based and\nBayesian phylogenetic inference. Because GFlowNets are well-suited for sampling\ncomplex combinatorial structures, they are a natural choice for exploring and\nsampling from the multimodal posterior distribution over tree topologies and\nevolutionary distances. We demonstrate that our amortized posterior sampler,\nPhyloGFN, produces diverse and high-quality evolutionary hypotheses on real\nbenchmark datasets. PhyloGFN is competitive with prior works in marginal\nlikelihood estimation and achieves a closer fit to the target distribution than\nstate-of-the-art variational inference methods.\n","authors":["Mingyang Zhou","Zichao Yan","Elliot Layne","Nikolay Malkin","Dinghuai Zhang","Moksh Jain","Mathieu Blanchette","Yoshua Bengio"],"pdf_url":"https://arxiv.org/pdf/2310.08774v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.00489v2","updated":"2023-10-12T23:37:01Z","published":"2023-09-30T20:59:42Z","title":"Dynamic DAG Discovery for Interpretable Imitation Learning","summary":" Imitation learning, which learns agent policy by mimicking expert\ndemonstration, has shown promising results in many applications such as medical\ntreatment regimes and self-driving vehicles. However, it remains a difficult\ntask to interpret control policies learned by the agent. Difficulties mainly\ncome from two aspects: 1) agents in imitation learning are usually implemented\nas deep neural networks, which are black-box models and lack interpretability;\n2) the latent causal mechanism behind agents' decisions may vary along the\ntrajectory, rather than staying static throughout time steps. To increase\ntransparency and offer better interpretability of the neural agent, we propose\nto expose its captured knowledge in the form of a directed acyclic causal\ngraph, with nodes being action and state variables and edges denoting the\ncausal relations behind predictions. Furthermore, we design this causal\ndiscovery process to be state-dependent, enabling it to model the dynamics in\nlatent causal graphs. Concretely, we conduct causal discovery from the\nperspective of Granger causality and propose a self-explainable imitation\nlearning framework, {\\method}. The proposed framework is composed of three\nparts: a dynamic causal discovery module, a causality encoding module, and a\nprediction module, and is trained in an end-to-end manner. After the model is\nlearned, we can obtain causal relations among states and action variables\nbehind its decisions, exposing policies learned by it. Experimental results on\nboth synthetic and real-world datasets demonstrate the effectiveness of the\nproposed {\\method} in learning the dynamic causal graphs for understanding the\ndecision-making of imitation learning meanwhile maintaining high prediction\naccuracy.\n","authors":["ianxiang Zhao","Wenchao Yu","Suhang Wang","Lu Wang","Xiang Zhang","Yuncong Chen","Yanchi Liu","Wei Cheng","Haifeng Chen"],"pdf_url":"https://arxiv.org/pdf/2310.00489v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08767v1","updated":"2023-10-12T23:26:44Z","published":"2023-10-12T23:26:44Z","title":"Modeling Fission Gas Release at the Mesoscale using Multiscale DenseNet\n Regression with Attention Mechanism and Inception Blocks","summary":" Mesoscale simulations of fission gas release (FGR) in nuclear fuel provide a\npowerful tool for understanding how microstructure evolution impacts FGR, but\nthey are computationally intensive. In this study, we present an alternate,\ndata-driven approach, using deep learning to predict instantaneous FGR flux\nfrom 2D nuclear fuel microstructure images. Four convolutional neural network\n(CNN) architectures with multiscale regression are trained and evaluated on\nsimulated FGR data generated using a hybrid phase field/cluster dynamics model.\nAll four networks show high predictive power, with $R^{2}$ values above 98%.\nThe best performing network combine a Convolutional Block Attention Module\n(CBAM) and InceptionNet mechanisms to provide superior accuracy (mean absolute\npercentage error of 4.4%), training stability, and robustness on very low\ninstantaneous FGR flux values.\n","authors":["Peter Toma","Md Ali Muntaha","Joel B. Harley","Michael R. Tonks"],"pdf_url":"https://arxiv.org/pdf/2310.08767v1.pdf","comment":"Submitted at Journal of Nuclear Materials, 20 pages, 10 figures, 3\n tables"},{"id":"http://arxiv.org/abs/2006.10189v4","updated":"2023-10-12T23:18:22Z","published":"2020-06-17T22:45:14Z","title":"Revisiting minimum description length complexity in overparameterized\n models","summary":" Complexity is a fundamental concept underlying statistical learning theory\nthat aims to inform generalization performance. Parameter count, while\nsuccessful in low-dimensional settings, is not well-justified for\noverparameterized settings when the number of parameters is more than the\nnumber of training samples. We revisit complexity measures based on Rissanen's\nprinciple of minimum description length (MDL) and define a novel MDL-based\ncomplexity (MDL-COMP) that remains valid for overparameterized models. MDL-COMP\nis defined via an optimality criterion over the encodings induced by a good\nRidge estimator class. We provide an extensive theoretical characterization of\nMDL-COMP for linear models and kernel methods and show that it is not just a\nfunction of parameter count, but rather a function of the singular values of\nthe design or the kernel matrix and the signal-to-noise ratio. For a linear\nmodel with $n$ observations, $d$ parameters, and i.i.d. Gaussian predictors,\nMDL-COMP scales linearly with $d$ when $dn$. For kernel methods, we show that MDL-COMP\ninforms minimax in-sample error, and can decrease as the dimensionality of the\ninput increases. We also prove that MDL-COMP upper bounds the in-sample mean\nsquared error (MSE). Via an array of simulations and real-data experiments, we\nshow that a data-driven Prac-MDL-COMP informs hyper-parameter tuning for\noptimizing test MSE with ridge regression in limited data settings, sometimes\nimproving upon cross-validation and (always) saving computational costs.\nFinally, our findings also suggest that the recently observed double decent\nphenomenons in overparameterized models might be a consequence of the choice of\nnon-ideal estimators.\n","authors":["Raaz Dwivedi","Chandan Singh","Bin Yu","Martin J. Wainwright"],"pdf_url":"https://arxiv.org/pdf/2006.10189v4.pdf","comment":"First two authors contributed equally"},{"id":"http://arxiv.org/abs/2310.08764v1","updated":"2023-10-12T23:17:56Z","published":"2023-10-12T23:17:56Z","title":"Calibrating Likelihoods towards Consistency in Summarization Models","summary":" Despite the recent advances in abstractive text summarization, current\nsummarization models still suffer from generating factually inconsistent\nsummaries, reducing their utility for real-world application. We argue that the\nmain reason for such behavior is that the summarization models trained with\nmaximum likelihood objective assign high probability to plausible sequences\ngiven the context, but they often do not accurately rank sequences by their\nconsistency. In this work, we solve this problem by calibrating the likelihood\nof model generated sequences to better align with a consistency metric measured\nby natural language inference (NLI) models. The human evaluation study and\nautomatic metrics show that the calibrated models generate more consistent and\nhigher-quality summaries. We also show that the models trained using our method\nreturn probabilities that are better aligned with the NLI scores, which\nsignificantly increase reliability of summarization models.\n","authors":["Polina Zablotskaia","Misha Khalman","Rishabh Joshi","Livio Baldini Soares","Shoshana Jakobovits","Joshua Maynez","Shashi Narayan"],"pdf_url":"https://arxiv.org/pdf/2310.08764v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08762v1","updated":"2023-10-12T23:06:52Z","published":"2023-10-12T23:06:52Z","title":"Stabilizing Subject Transfer in EEG Classification with Divergence\n Estimation","summary":" Classification models for electroencephalogram (EEG) data show a large\ndecrease in performance when evaluated on unseen test sub jects. We reduce this\nperformance decrease using new regularization techniques during model training.\nWe propose several graphical models to describe an EEG classification task.\nFrom each model, we identify statistical relationships that should hold true in\nan idealized training scenario (with infinite data and a globally-optimal\nmodel) but that may not hold in practice. We design regularization penalties to\nenforce these relationships in two stages. First, we identify suitable proxy\nquantities (divergences such as Mutual Information and Wasserstein-1) that can\nbe used to measure statistical independence and dependence relationships.\nSecond, we provide algorithms to efficiently estimate these quantities during\ntraining using secondary neural network models. We conduct extensive\ncomputational experiments using a large benchmark EEG dataset, comparing our\nproposed techniques with a baseline method that uses an adversarial classifier.\nWe find our proposed methods significantly increase balanced accuracy on test\nsubjects and decrease overfitting. The proposed methods exhibit a larger\nbenefit over a greater range of hyperparameters than the baseline method, with\nonly a small computational cost at training time. These benefits are largest\nwhen used for a fixed training period, though there is still a significant\nbenefit for a subset of hyperparameters when our techniques are used in\nconjunction with early stopping regularization.\n","authors":["Niklas Smedemark-Margulies","Ye Wang","Toshiaki Koike-Akino","Jing Liu","Kieran Parsons","Yunus Bicer","Deniz Erdogmus"],"pdf_url":"https://arxiv.org/pdf/2310.08762v1.pdf","comment":"16 pages, 5 figures"},{"id":"http://arxiv.org/abs/2303.06827v2","updated":"2023-10-12T23:04:25Z","published":"2023-03-13T03:00:03Z","title":"Kernel Density Bayesian Inverse Reinforcement Learning","summary":" Inverse reinforcement learning~(IRL) is a powerful framework to infer an\nagent's reward function by observing its behavior, but IRL algorithms that\nlearn point estimates of the reward function can be misleading because there\nmay be several functions that describe an agent's behavior equally well. A\nBayesian approach to IRL models a distribution over candidate reward functions,\nalleviating the shortcomings of learning a point estimate. However, several\nBayesian IRL algorithms use a $Q$-value function in place of the likelihood\nfunction. The resulting posterior is computationally intensive to calculate,\nhas few theoretical guarantees, and the $Q$-value function is often a poor\napproximation for the likelihood. We introduce kernel density Bayesian IRL\n(KD-BIRL), which uses conditional kernel density estimation to directly\napproximate the likelihood, providing an efficient framework that, with a\nmodified reward function parameterization, is applicable to environments with\ncomplex and infinite state spaces. We demonstrate KD-BIRL's benefits through a\nseries of experiments in Gridworld environments and a simulated sepsis\ntreatment task.\n","authors":["Aishwarya Mandyam","Didong Li","Diana Cai","Andrew Jones","Barbara E. Engelhardt"],"pdf_url":"https://arxiv.org/pdf/2303.06827v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08759v1","updated":"2023-10-12T22:56:53Z","published":"2023-10-12T22:56:53Z","title":"Question Answering for Electronic Health Records: A Scoping Review of\n datasets and models","summary":" Question Answering (QA) systems on patient-related data can assist both\nclinicians and patients. They can, for example, assist clinicians in\ndecision-making and enable patients to have a better understanding of their\nmedical history. Significant amounts of patient data are stored in Electronic\nHealth Records (EHRs), making EHR QA an important research area. In EHR QA, the\nanswer is obtained from the medical record of the patient. Because of the\ndifferences in data format and modality, this differs greatly from other\nmedical QA tasks that employ medical websites or scientific papers to retrieve\nanswers, making it critical to research EHR question answering. This study\naimed to provide a methodological review of existing works on QA over EHRs. We\nsearched for articles from January 1st, 2005 to September 30th, 2023 in four\ndigital sources including Google Scholar, ACL Anthology, ACM Digital Library,\nand PubMed to collect relevant publications on EHR QA. 4111 papers were\nidentified for our study, and after screening based on our inclusion criteria,\nwe obtained a total of 47 papers for further study. Out of the 47 papers, 25\npapers were about EHR QA datasets, and 37 papers were about EHR QA models. It\nwas observed that QA on EHRs is relatively new and unexplored. Most of the\nworks are fairly recent. Also, it was observed that emrQA is by far the most\npopular EHR QA dataset, both in terms of citations and usage in other papers.\nFurthermore, we identified the different models used in EHR QA along with the\nevaluation metrics used for these models.\n","authors":["Jayetri Bardhan","Kirk Roberts","Daisy Zhe Wang"],"pdf_url":"https://arxiv.org/pdf/2310.08759v1.pdf","comment":"5 tables, 6 figures"},{"id":"http://arxiv.org/abs/2303.06171v4","updated":"2023-10-12T22:55:43Z","published":"2023-03-10T19:14:20Z","title":"DP-Fast MH: Private, Fast, and Accurate Metropolis-Hastings for\n Large-Scale Bayesian Inference","summary":" Bayesian inference provides a principled framework for learning from complex\ndata and reasoning under uncertainty. It has been widely applied in machine\nlearning tasks such as medical diagnosis, drug design, and policymaking. In\nthese common applications, data can be highly sensitive. Differential privacy\n(DP) offers data analysis tools with powerful worst-case privacy guarantees and\nhas been developed as the leading approach in privacy-preserving data analysis.\nIn this paper, we study Metropolis-Hastings (MH), one of the most fundamental\nMCMC methods, for large-scale Bayesian inference under differential privacy.\nWhile most existing private MCMC algorithms sacrifice accuracy and efficiency\nto obtain privacy, we provide the first exact and fast DP MH algorithm, using\nonly a minibatch of data in most iterations. We further reveal, for the first\ntime, a three-way trade-off among privacy, scalability (i.e. the batch size),\nand efficiency (i.e. the convergence rate), theoretically characterizing how\nprivacy affects the utility and computational cost in Bayesian inference. We\nempirically demonstrate the effectiveness and efficiency of our algorithm in\nvarious experiments.\n","authors":["Wanrong Zhang","Ruqi Zhang"],"pdf_url":"https://arxiv.org/pdf/2303.06171v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08757v1","updated":"2023-10-12T22:52:29Z","published":"2023-10-12T22:52:29Z","title":"Detection and prediction of clopidogrel treatment failures using\n longitudinal structured electronic health records","summary":" We propose machine learning algorithms to automatically detect and predict\nclopidogrel treatment failure using longitudinal structured electronic health\nrecords (EHR). By drawing analogies between natural language and structured\nEHR, we introduce various machine learning algorithms used in natural language\nprocessing (NLP) applications to build models for treatment failure detection\nand prediction. In this regard, we generated a cohort of patients with\nclopidogrel prescriptions from UK Biobank and annotated if the patients had\ntreatment failure events within one year of the first clopidogrel prescription;\nout of 502,527 patients, 1,824 patients were identified as treatment failure\ncases, and 6,859 patients were considered as control cases. From the dataset,\nwe gathered diagnoses, prescriptions, and procedure records together per\npatient and organized them into visits with the same date to build models. The\nmodels were built for two different tasks, i.e., detection and prediction, and\nthe experimental results showed that time series models outperform bag-of-words\napproaches in both tasks. In particular, a Transformer-based model, namely\nBERT, could reach 0.928 AUC in detection tasks and 0.729 AUC in prediction\ntasks. BERT also showed competence over other time series models when there is\nnot enough training data, because it leverages the pre-training procedure using\nlarge unlabeled data.\n","authors":["Samuel Kim","In Gu Sean Lee","Mijeong Irene Ban","Jane Chiang"],"pdf_url":"https://arxiv.org/pdf/2310.08757v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.14941v2","updated":"2023-10-12T22:49:49Z","published":"2023-06-26T17:54:24Z","title":"SIMMF: Semantics-aware Interactive Multiagent Motion Forecasting for\n Autonomous Vehicle Driving","summary":" Autonomous vehicles require motion forecasting of their surrounding\nmultiagents (pedestrians and vehicles) to make optimal decisions for\nnavigation. The existing methods focus on techniques to utilize the positions\nand velocities of these agents and fail to capture semantic information from\nthe scene. Moreover, to mitigate the increase in computational complexity\nassociated with the number of agents in the scene, some works leverage\nEuclidean distance to prune far-away agents. However, distance-based metric\nalone is insufficient to select relevant agents and accurately perform their\npredictions. To resolve these issues, we propose the Semantics-aware\nInteractive Multiagent Motion Forecasting (SIMMF) method to capture semantics\nalong with spatial information and optimally select relevant agents for motion\nprediction. Specifically, we achieve this by implementing a semantic-aware\nselection of relevant agents from the scene and passing them through an\nattention mechanism to extract global encodings. These encodings along with\nagents' local information, are passed through an encoder to obtain\ntime-dependent latent variables for a motion policy predicting the future\ntrajectories. Our results show that the proposed approach outperforms\nstate-of-the-art baselines and provides more accurate and scene-consistent\npredictions.\n","authors":["Vidyaa Krishnan Nivash","Ahmed H. Qureshi"],"pdf_url":"https://arxiv.org/pdf/2306.14941v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08754v1","updated":"2023-10-12T22:44:19Z","published":"2023-10-12T22:44:19Z","title":"Tokenizer Choice For LLM Training: Negligible or Crucial?","summary":" The recent success of LLMs has been predominantly driven by curating the\ntraining dataset composition, scaling of model architectures and dataset sizes\nand advancements in pretraining objectives, leaving tokenizer influence as a\nblind spot. Shedding light on this underexplored area, we conduct a\ncomprehensive study on the influence of tokenizer choice on LLM downstream\nperformance by training 24 mono- and multilingual LLMs at a 2.6B parameter\nscale, ablating different tokenizer algorithms and parameterizations. Our\nstudies highlight that the tokenizer choice can significantly impact the\nmodel's downstream performance, training and inference costs. In particular, we\nfind that the common tokenizer evaluation metrics fertility and parity are not\nalways predictive of model downstream performance, rendering these metrics a\nquestionable choice for tokenizer evaluation. Furthermore, we show that\nmultilingual tokenizers trained on the five most frequent European languages\nrequire vocabulary size increases of factor three in comparison to English.\nWhile English-only tokenizers have been applied to the training of\nmulti-lingual LLMs in the past, we find that this approach results in a severe\ndownstream performance degradation and additional training costs of up to 68%,\ndue to an inefficient tokenization vocabulary.\n","authors":["Mehdi Ali","Michael Fromm","Klaudia Thellmann","Richard Rutmann","Max Lübbering","Johannes Leveling","Katrin Klug","Jan Ebert","Niclas Doll","Jasper Schulze Buschhoff","Charvi Jain","Alexander Arno Weber","Lena Jurkschat","Hammam Abdelwahab","Chelsea John","Pedro Ortiz Suarez","Malte Ostendorff","Samuel Weinbach","Rafet Sifa","Stefan Kesselheim","Nicolas Flores-Herr"],"pdf_url":"https://arxiv.org/pdf/2310.08754v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08751v1","updated":"2023-10-12T22:32:00Z","published":"2023-10-12T22:32:00Z","title":"Constrained Bayesian Optimization with Adaptive Active Learning of\n Unknown Constraints","summary":" Optimizing objectives under constraints, where both the objectives and\nconstraints are black box functions, is a common scenario in real-world\napplications such as scientific experimental design, design of medical\ntherapies, and industrial process optimization. One popular approach to\nhandling these complex scenarios is Bayesian Optimization (BO). In terms of\ntheoretical behavior, BO is relatively well understood in the unconstrained\nsetting, where its principles have been well explored and validated. However,\nwhen it comes to constrained Bayesian optimization (CBO), the existing\nframework often relies on heuristics or approximations without the same level\nof theoretical guarantees.\n In this paper, we delve into the theoretical and practical aspects of\nconstrained Bayesian optimization, where the objective and constraints can be\nindependently evaluated and are subject to noise. By recognizing that both the\nobjective and constraints can help identify high-confidence regions of interest\n(ROI), we propose an efficient CBO framework that intersects the ROIs\nidentified from each aspect to determine the general ROI. The ROI, coupled with\na novel acquisition function that adaptively balances the optimization of the\nobjective and the identification of feasible regions, enables us to derive\nrigorous theoretical justifications for its performance. We showcase the\nefficiency and robustness of our proposed CBO framework through empirical\nevidence and discuss the fundamental challenge of deriving practical regret\nbounds for CBO algorithms.\n","authors":["Fengxue Zhang","Zejie Zhu","Yuxin Chen"],"pdf_url":"https://arxiv.org/pdf/2310.08751v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08750v1","updated":"2023-10-12T22:30:15Z","published":"2023-10-12T22:30:15Z","title":"Search-Adaptor: Text Embedding Customization for Information Retrieval","summary":" Text embeddings extracted by pre-trained Large Language Models (LLMs) have\nsignificant potential to improve information retrieval and search. Beyond the\nzero-shot setup in which they are being conventionally used, being able to take\nadvantage of the information from the relevant query-corpus paired data has the\npower to further boost the LLM capabilities. In this paper, we propose a novel\nmethod, Search-Adaptor, for customizing LLMs for information retrieval in an\nefficient and robust way. Search-Adaptor modifies the original text embedding\ngenerated by pre-trained LLMs, and can be integrated with any LLM, including\nthose only available via APIs. On multiple real-world English and multilingual\nretrieval datasets, we show consistent and significant performance benefits for\nSearch-Adaptor -- e.g., more than 5.2% improvements over the Google Embedding\nAPIs in nDCG@10 averaged over 13 BEIR datasets.\n","authors":["Jinsung Yoon","Sercan O Arik","Yanfei Chen","Tomas Pfister"],"pdf_url":"https://arxiv.org/pdf/2310.08750v1.pdf","comment":"9 pages, 2 figures"},{"id":"http://arxiv.org/abs/2310.08748v1","updated":"2023-10-12T22:28:53Z","published":"2023-10-12T22:28:53Z","title":"Evolutionary Dynamic Optimization and Machine Learning","summary":" Evolutionary Computation (EC) has emerged as a powerful field of Artificial\nIntelligence, inspired by nature's mechanisms of gradual development. However,\nEC approaches often face challenges such as stagnation, diversity loss,\ncomputational complexity, population initialization, and premature convergence.\nTo overcome these limitations, researchers have integrated learning algorithms\nwith evolutionary techniques. This integration harnesses the valuable data\ngenerated by EC algorithms during iterative searches, providing insights into\nthe search space and population dynamics. Similarly, the relationship between\nevolutionary algorithms and Machine Learning (ML) is reciprocal, as EC methods\noffer exceptional opportunities for optimizing complex ML tasks characterized\nby noisy, inaccurate, and dynamic objective functions. These hybrid techniques,\nknown as Evolutionary Machine Learning (EML), have been applied at various\nstages of the ML process. EC techniques play a vital role in tasks such as data\nbalancing, feature selection, and model training optimization. Moreover, ML\ntasks often require dynamic optimization, for which Evolutionary Dynamic\nOptimization (EDO) is valuable. This paper presents the first comprehensive\nexploration of reciprocal integration between EDO and ML. The study aims to\nstimulate interest in the evolutionary learning community and inspire\ninnovative contributions in this domain.\n","authors":["Abdennour Boulesnane"],"pdf_url":"https://arxiv.org/pdf/2310.08748v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.06366v4","updated":"2023-10-12T22:25:43Z","published":"2022-10-12T16:18:25Z","title":"A Generalist Framework for Panoptic Segmentation of Images and Videos","summary":" Panoptic segmentation assigns semantic and instance ID labels to every pixel\nof an image. As permutations of instance IDs are also valid solutions, the task\nrequires learning of high-dimensional one-to-many mapping. As a result,\nstate-of-the-art approaches use customized architectures and task-specific loss\nfunctions. We formulate panoptic segmentation as a discrete data generation\nproblem, without relying on inductive bias of the task. A diffusion model is\nproposed to model panoptic masks, with a simple architecture and generic loss\nfunction. By simply adding past predictions as a conditioning signal, our\nmethod is capable of modeling video (in a streaming setting) and thereby learns\nto track object instances automatically. With extensive experiments, we\ndemonstrate that our simple approach can perform competitively to\nstate-of-the-art specialist methods in similar settings.\n","authors":["Ting Chen","Lala Li","Saurabh Saxena","Geoffrey Hinton","David J. Fleet"],"pdf_url":"https://arxiv.org/pdf/2210.06366v4.pdf","comment":"ICCV'23. Code at https://github.com/google-research/pix2seq"},{"id":"http://arxiv.org/abs/2310.08746v1","updated":"2023-10-12T22:19:36Z","published":"2023-10-12T22:19:36Z","title":"Robustness to Multi-Modal Environment Uncertainty in MARL using\n Curriculum Learning","summary":" Multi-agent reinforcement learning (MARL) plays a pivotal role in tackling\nreal-world challenges. However, the seamless transition of trained policies\nfrom simulations to real-world requires it to be robust to various\nenvironmental uncertainties. Existing works focus on finding Nash Equilibrium\nor the optimal policy under uncertainty in one environment variable (i.e.\naction, state or reward). This is because a multi-agent system itself is highly\ncomplex and unstationary. However, in real-world situation uncertainty can\noccur in multiple environment variables simultaneously. This work is the first\nto formulate the generalised problem of robustness to multi-modal environment\nuncertainty in MARL. To this end, we propose a general robust training approach\nfor multi-modal uncertainty based on curriculum learning techniques. We handle\ntwo distinct environmental uncertainty simultaneously and present extensive\nresults across both cooperative and competitive MARL environments,\ndemonstrating that our approach achieves state-of-the-art levels of robustness.\n","authors":["Aakriti Agrawal","Rohith Aralikatti","Yanchao Sun","Furong Huang"],"pdf_url":"https://arxiv.org/pdf/2310.08746v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08744v1","updated":"2023-10-12T22:12:28Z","published":"2023-10-12T22:12:28Z","title":"Circuit Component Reuse Across Tasks in Transformer Language Models","summary":" Recent work in mechanistic interpretability has shown that behaviors in\nlanguage models can be successfully reverse-engineered through circuit\nanalysis. A common criticism, however, is that each circuit is task-specific,\nand thus such analysis cannot contribute to understanding the models at a\nhigher level. In this work, we present evidence that insights (both low-level\nfindings about specific heads and higher-level findings about general\nalgorithms) can indeed generalize across tasks. Specifically, we study the\ncircuit discovered in Wang et al. (2022) for the Indirect Object Identification\n(IOI) task and 1.) show that it reproduces on a larger GPT2 model, and 2.) that\nit is mostly reused to solve a seemingly different task: Colored Objects\n(Ippolito & Callison-Burch, 2023). We provide evidence that the process\nunderlying both tasks is functionally very similar, and contains about a 78%\noverlap in in-circuit attention heads. We further present a proof-of-concept\nintervention experiment, in which we adjust four attention heads in middle\nlayers in order to 'repair' the Colored Objects circuit and make it behave like\nthe IOI circuit. In doing so, we boost accuracy from 49.6% to 93.7% on the\nColored Objects task and explain most sources of error. The intervention\naffects downstream attention heads in specific ways predicted by their\ninteractions in the IOI circuit, indicating that this subcircuit behavior is\ninvariant to the different task inputs. Overall, our results provide evidence\nthat it may yet be possible to explain large language models' behavior in terms\nof a relatively small number of interpretable task-general algorithmic building\nblocks and computational components.\n","authors":["Jack Merullo","Carsten Eickhoff","Ellie Pavlick"],"pdf_url":"https://arxiv.org/pdf/2310.08744v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08743v1","updated":"2023-10-12T22:09:53Z","published":"2023-10-12T22:09:53Z","title":"Development and Validation of a Deep Learning-Based Microsatellite\n Instability Predictor from Prostate Cancer Whole-Slide Images","summary":" Microsatellite instability-high (MSI-H) is a tumor agnostic biomarker for\nimmune checkpoint inhibitor therapy. However, MSI status is not routinely\ntested in prostate cancer, in part due to low prevalence and assay cost. As\nsuch, prediction of MSI status from hematoxylin and eosin (H&E) stained\nwhole-slide images (WSIs) could identify prostate cancer patients most likely\nto benefit from confirmatory testing and becoming eligible for immunotherapy.\nProstate biopsies and surgical resections from de-identified records of\nconsecutive prostate cancer patients referred to our institution were analyzed.\nTheir MSI status was determined by next generation sequencing. Patients before\na cutoff date were split into an algorithm development set (n=4015, MSI-H 1.8%)\nand a paired validation set (n=173, MSI-H 19.7%) that consisted of two serial\nsections from each sample, one stained and scanned internally and the other at\nan external site. Patients after the cutoff date formed the temporal validation\nset (n=1350, MSI-H 2.3%). Attention-based multiple instance learning models\nwere trained to predict MSI-H from H&E WSIs. The MSI-H predictor achieved area\nunder the receiver operating characteristic curve values of 0.78 (95% CI\n[0.69-0.86]), 0.72 (95% CI [0.63-0.81]), and 0.72 (95% CI [0.62-0.82]) on the\ninternally prepared, externally prepared, and temporal validation sets,\nrespectively. While MSI-H status is significantly correlated with Gleason\nscore, the model remained predictive within each Gleason score subgroup. In\nsummary, we developed and validated an AI-based MSI-H diagnostic model on a\nlarge real-world cohort of routine H&E slides, which effectively generalized to\nexternally stained and scanned samples and a temporally independent validation\ncohort. This algorithm has the potential to direct prostate cancer patients\ntoward immunotherapy and to identify MSI-H cases secondary to Lynch syndrome.\n","authors":["Qiyuan Hu","Abbas A. Rizvi","Geoffery Schau","Kshitij Ingale","Yoni Muller","Rachel Baits","Sebastian Pretzer","Aïcha BenTaieb","Abigail Gordhamer","Roberto Nussenzveig","Adam Cole","Matthew O. Leavitt","Rohan P. Joshi","Nike Beaubier","Martin C. Stumpe","Kunal Nagpal"],"pdf_url":"https://arxiv.org/pdf/2310.08743v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.14600v2","updated":"2023-10-12T21:58:06Z","published":"2023-05-24T00:46:02Z","title":"Learning Semantic Role Labeling from Compatible Label Sequences","summary":" Semantic role labeling (SRL) has multiple disjoint label sets, e.g., VerbNet\nand PropBank. Creating these datasets is challenging, therefore a natural\nquestion is how to use each one to help the other. Prior work has shown that\ncross-task interaction helps, but only explored multitask learning so far. A\ncommon issue with multi-task setup is that argument sequences are still\nseparately decoded, running the risk of generating structurally inconsistent\nlabel sequences (as per lexicons like Semlink). In this paper, we eliminate\nsuch issue with a framework that jointly models VerbNet and PropBank labels as\none sequence. In this setup, we show that enforcing Semlink constraints during\ndecoding constantly improves the overall F1. With special input constructions,\nour joint model infers VerbNet arguments from given PropBank arguments with\nover 99 F1. For learning, we propose a constrained marginal model that learns\nwith knowledge defined in Semlink to further benefit from the large amounts of\nPropBank-only data. On the joint benchmark based on CoNLL05, our models achieve\nstate-of-the-art F1's, outperforming the prior best in-domain model by 3.5\n(VerbNet) and 0.8 (PropBank). For out-of-domain generalization, our models\nsurpass the prior best by 3.4 (VerbNet) and 0.2 (PropBank).\n","authors":["Tao Li","Ghazaleh Kazeminejad","Susan W. Brown","Martha Palmer","Vivek Srikumar"],"pdf_url":"https://arxiv.org/pdf/2305.14600v2.pdf","comment":"Accepted at Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.08738v1","updated":"2023-10-12T21:51:25Z","published":"2023-10-12T21:51:25Z","title":"Splicing Up Your Predictions with RNA Contrastive Learning","summary":" In the face of rapidly accumulating genomic data, our understanding of the\nRNA regulatory code remains incomplete. Recent self-supervised methods in other\ndomains have demonstrated the ability to learn rules underlying the\ndata-generating process such as sentence structure in language. Inspired by\nthis, we extend contrastive learning techniques to genomic data by utilizing\nfunctional similarities between sequences generated through alternative\nsplicing and gene duplication. Our novel dataset and contrastive objective\nenable the learning of generalized RNA isoform representations. We validate\ntheir utility on downstream tasks such as RNA half-life and mean ribosome load\nprediction. Our pre-training strategy yields competitive results using linear\nprobing on both tasks, along with up to a two-fold increase in Pearson\ncorrelation in low-data conditions. Importantly, our exploration of the learned\nlatent space reveals that our contrastive objective yields semantically\nmeaningful representations, underscoring its potential as a valuable\ninitialization technique for RNA property prediction.\n","authors":["Philip Fradkin","Ruian Shi","Bo Wang","Brendan Frey","Leo J. Lee"],"pdf_url":"https://arxiv.org/pdf/2310.08738v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2301.00051v2","updated":"2023-10-12T21:47:53Z","published":"2022-12-30T20:38:54Z","title":"Learning from Guided Play: Improving Exploration for Adversarial\n Imitation Learning with Simple Auxiliary Tasks","summary":" Adversarial imitation learning (AIL) has become a popular alternative to\nsupervised imitation learning that reduces the distribution shift suffered by\nthe latter. However, AIL requires effective exploration during an online\nreinforcement learning phase. In this work, we show that the standard, naive\napproach to exploration can manifest as a suboptimal local maximum if a policy\nlearned with AIL sufficiently matches the expert distribution without fully\nlearning the desired task. This can be particularly catastrophic for\nmanipulation tasks, where the difference between an expert and a non-expert\nstate-action pair is often subtle. We present Learning from Guided Play (LfGP),\na framework in which we leverage expert demonstrations of multiple exploratory,\nauxiliary tasks in addition to a main task. The addition of these auxiliary\ntasks forces the agent to explore states and actions that standard AIL may\nlearn to ignore. Additionally, this particular formulation allows for the\nreusability of expert data between main tasks. Our experimental results in a\nchallenging multitask robotic manipulation domain indicate that LfGP\nsignificantly outperforms both AIL and behaviour cloning, while also being more\nexpert sample efficient than these baselines. To explain this performance gap,\nwe provide further analysis of a toy problem that highlights the coupling\nbetween a local maximum and poor exploration, and also visualize the\ndifferences between the learned models from AIL and LfGP.\n","authors":["Trevor Ablett","Bryan Chan","Jonathan Kelly"],"pdf_url":"https://arxiv.org/pdf/2301.00051v2.pdf","comment":"In IEEE Robotics and Automation Letters (RA-L) and presented at the\n IEEE/RSJ International Conference on Intelligent Robots and Systems\n (IROS'23), Detroit, MI, USA, Oct. 1-5, 2023. arXiv admin note: substantial\n text overlap with arXiv:2112.08932"},{"id":"http://arxiv.org/abs/2206.14697v3","updated":"2023-10-12T21:47:28Z","published":"2022-06-29T14:54:49Z","title":"Hidden Parameter Recurrent State Space Models For Changing Dynamics\n Scenarios","summary":" Recurrent State-space models (RSSMs) are highly expressive models for\nlearning patterns in time series data and system identification. However, these\nmodels assume that the dynamics are fixed and unchanging, which is rarely the\ncase in real-world scenarios. Many control applications often exhibit tasks\nwith similar but not identical dynamics which can be modeled as a latent\nvariable. We introduce the Hidden Parameter Recurrent State Space Models\n(HiP-RSSMs), a framework that parametrizes a family of related dynamical\nsystems with a low-dimensional set of latent factors. We present a simple and\neffective way of learning and performing inference over this Gaussian graphical\nmodel that avoids approximations like variational inference. We show that\nHiP-RSSMs outperforms RSSMs and competing multi-task models on several\nchallenging robotic benchmarks both on real-world systems and simulations.\n","authors":["Vaisakh Shaj","Dieter Buchler","Rohit Sonker","Philipp Becker","Gerhard Neumann"],"pdf_url":"https://arxiv.org/pdf/2206.14697v3.pdf","comment":"Published at the International Conference on Learning\n Representations, ICLR 2022"},{"id":"http://arxiv.org/abs/2305.16130v2","updated":"2023-10-12T21:43:18Z","published":"2023-05-25T15:04:01Z","title":"A Mechanism for Solving Relational Tasks in Transformer Language Models","summary":" A primary criticism towards language models (LMs) is their inscrutability.\nThis paper presents evidence that, despite their size and complexity, LMs\nsometimes exploit a simple computational mechanism to solve one-to-one\nrelational tasks (e.g., capital_of(Poland)=Warsaw). We investigate a range of\nlanguage model sizes (from 124M parameters to 176B parameters) in an in-context\nlearning setting, and find that for a variety of tasks (involving capital\ncities, upper-casing, and past-tensing) a key part of the mechanism reduces to\na simple linear update typically applied by the feedforward (FFN) networks.\nThese updates also tend to promote the output of the relation in a\ncontent-independent way (e.g., encoding Poland:Warsaw::China:Beijing),\nrevealing a predictable pattern that these models take in solving these tasks.\nWe further show that this mechanism is specific to tasks that require retrieval\nfrom pretraining memory, rather than retrieval from local context. Our results\ncontribute to a growing body of work on the mechanistic interpretability of\nLLMs, and offer reason to be optimistic that, despite the massive and\nnon-linear nature of the models, the strategies they ultimately use to solve\ntasks can sometimes reduce to familiar and even intuitive algorithms.\n","authors":["Jack Merullo","Carsten Eickhoff","Ellie Pavlick"],"pdf_url":"https://arxiv.org/pdf/2305.16130v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08732v1","updated":"2023-10-12T21:39:16Z","published":"2023-10-12T21:39:16Z","title":"Provably Robust Cost-Sensitive Learning via Randomized Smoothing","summary":" We focus on learning adversarially robust classifiers under a cost-sensitive\nscenario, where the potential harm of different classwise adversarial\ntransformations is encoded in a binary cost matrix. Existing methods are either\nempirical that cannot certify robustness or suffer from inherent scalability\nissues. In this work, we study whether randomized smoothing, a more scalable\nrobustness certification framework, can be leveraged to certify cost-sensitive\nrobustness. Built upon a notion of cost-sensitive certified radius, we show how\nto adapt the standard randomized smoothing certification pipeline to produce\ntight robustness guarantees for any cost matrix. In addition, with fine-grained\ncertified radius optimization schemes specifically designed for different data\nsubgroups, we propose an algorithm to train smoothed classifiers that are\noptimized for cost-sensitive robustness. Extensive experiments on image\nbenchmarks and a real-world medical dataset demonstrate the superiority of our\nmethod in achieving significantly improved performance of certified\ncost-sensitive robustness while having a negligible impact on overall accuracy.\n","authors":["Yuan Xin","Michael Backes","Xiao Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.08732v1.pdf","comment":"18 pages, 7 tables, 4 figures"},{"id":"http://arxiv.org/abs/2310.03320v2","updated":"2023-10-12T21:38:27Z","published":"2023-10-05T05:30:42Z","title":"BioBridge: Bridging Biomedical Foundation Models via Knowledge Graph","summary":" Foundation models (FMs) are able to leverage large volumes of unlabeled data\nto demonstrate superior performance across a wide range of tasks. However, FMs\ndeveloped for biomedical domains have largely remained unimodal, i.e.,\nindependently trained and used for tasks on protein sequences alone, small\nmolecule structures alone, or clinical data alone. To overcome this limitation\nof biomedical FMs, we present BioBridge, a novel parameter-efficient learning\nframework, to bridge independently trained unimodal FMs to establish multimodal\nbehavior. BioBridge achieves it by utilizing Knowledge Graphs (KG) to learn\ntransformations between one unimodal FM and another without fine-tuning any\nunderlying unimodal FMs. Our empirical results demonstrate that BioBridge can\nbeat the best baseline KG embedding methods (on average by around 76.3%) in\ncross-modal retrieval tasks. We also identify BioBridge demonstrates\nout-of-domain generalization ability by extrapolating to unseen modalities or\nrelations. Additionally, we also show that BioBridge presents itself as a\ngeneral purpose retriever that can aid biomedical multimodal question answering\nas well as enhance the guided generation of novel drugs.\n","authors":["Zifeng Wang","Zichen Wang","Balasubramaniam Srinivasan","Vassilis N. Ioannidis","Huzefa Rangwala","Rishita Anubhai"],"pdf_url":"https://arxiv.org/pdf/2310.03320v2.pdf","comment":"this paper needs further internal review for being published"},{"id":"http://arxiv.org/abs/2310.08731v1","updated":"2023-10-12T21:38:07Z","published":"2023-10-12T21:38:07Z","title":"A Simple Way to Incorporate Novelty Detection in World Models","summary":" Reinforcement learning (RL) using world models has found significant recent\nsuccesses. However, when a sudden change to world mechanics or properties\noccurs then agent performance and reliability can dramatically decline. We\nrefer to the sudden change in visual properties or state transitions as {\\em\nnovelties}. Implementing novelty detection within generated world model\nframeworks is a crucial task for protecting the agent when deployed. In this\npaper, we propose straightforward bounding approaches to incorporate novelty\ndetection into world model RL agents, by utilizing the misalignment of the\nworld model's hallucinated states and the true observed states as an anomaly\nscore. We first provide an ontology of novelty detection relevant to sequential\ndecision making, then we provide effective approaches to detecting novelties in\na distribution of transitions learned by an agent in a world model. Finally, we\nshow the advantage of our work in a novel environment compared to traditional\nmachine learning novelty detection methods as well as currently accepted RL\nfocused novelty detection algorithms.\n","authors":["Geigh Zollicoffer","Kenneth Eaton","Jonathan Balloch","Julia Kim","Mark O. Riedl","Robert Wright"],"pdf_url":"https://arxiv.org/pdf/2310.08731v1.pdf","comment":null}],"Multimedia":[{"id":"http://arxiv.org/abs/2112.09726v2","updated":"2023-10-12T17:57:51Z","published":"2021-12-17T19:22:01Z","title":"Soundify: Matching Sound Effects to Video","summary":" In the art of video editing, sound helps add character to an object and\nimmerse the viewer within a space. Through formative interviews with\nprofessional editors (N=10), we found that the task of adding sounds to video\ncan be challenging. This paper presents Soundify, a system that assists editors\nin matching sounds to video. Given a video, Soundify identifies matching\nsounds, synchronizes the sounds to the video, and dynamically adjusts panning\nand volume to create spatial audio. In a human evaluation study (N=889), we\nshow that Soundify is capable of matching sounds to video out-of-the-box for a\ndiverse range of audio categories. In a within-subjects expert study (N=12), we\ndemonstrate the usefulness of Soundify in helping video editors match sounds to\nvideo with lighter workload, reduced task completion time, and improved\nusability.\n","authors":["David Chuan-En Lin","Anastasis Germanidis","Cristóbal Valenzuela","Yining Shi","Nikolas Martelaro"],"pdf_url":"https://arxiv.org/pdf/2112.09726v2.pdf","comment":"Full paper in UIST 2023; Short paper in NeurIPS 2021 ML4CD Workshop;\n Online demo: https://soundify.cc"},{"id":"http://arxiv.org/abs/2310.08475v1","updated":"2023-10-12T16:32:44Z","published":"2023-10-12T16:32:44Z","title":"Can We Edit Multimodal Large Language Models?","summary":" In this paper, we focus on editing Multimodal Large Language Models (MLLMs).\nCompared to editing single-modal LLMs, multimodal model editing is more\nchallenging, which demands a higher level of scrutiny and careful consideration\nin the editing process. To facilitate research in this area, we construct a new\nbenchmark, dubbed MMEdit, for editing multimodal LLMs and establishing a suite\nof innovative metrics for evaluation. We conduct comprehensive experiments\ninvolving various model editing baselines and analyze the impact of editing\ndifferent components for multimodal LLMs. Empirically, we notice that previous\nbaselines can implement editing multimodal LLMs to some extent, but the effect\nis still barely satisfactory, indicating the potential difficulty of this task.\nWe hope that our work can provide the NLP community with insights\\footnote{Code\nand dataset are available in https://github.com/zjunlp/EasyEdit.\n","authors":["Siyuan Cheng","Bozhong Tian","Qingbin Liu","Xi Chen","Yongheng Wang","Huajun Chen","Ningyu Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.08475v1.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.08205v1","updated":"2023-10-12T10:51:17Z","published":"2023-10-12T10:51:17Z","title":"LiveVV: Human-Centered Live Volumetric Video Streaming System","summary":" Volumetric video has emerged as a prominent medium within the realm of\neXtended Reality (XR) with the advancements in computer graphics and depth\ncapture hardware. Users can fully immersive themselves in volumetric video with\nthe ability to switch their viewport in six degree-of-freedom (DOF), including\nthree rotational dimensions (yaw, pitch, roll) and three translational\ndimensions (X, Y, Z). Different from traditional 2D videos that are composed of\npixel matrices, volumetric videos employ point clouds, meshes, or voxels to\nrepresent a volumetric scene, resulting in significantly larger data sizes.\nWhile previous works have successfully achieved volumetric video streaming in\nvideo-on-demand scenarios, the live streaming of volumetric video remains an\nunresolved challenge due to the limited network bandwidth and stringent latency\nconstraints. In this paper, we for the first time propose a holistic live\nvolumetric video streaming system, LiveVV, which achieves multi-view capture,\nscene segmentation \\& reuse, adaptive transmission, and rendering. LiveVV\ncontains multiple lightweight volumetric video capture modules that are capable\nof being deployed without prior preparation. To reduce bandwidth consumption,\nLiveVV processes static and dynamic volumetric content separately by reusing\nstatic data with low disparity and decimating data with low visual saliency.\nBesides, to deal with network fluctuation, LiveVV integrates a volumetric video\nadaptive bitrate streaming algorithm (VABR) to enable fluent playback with the\nmaximum quality of experience. Extensive real-world experiment shows that\nLiveVV can achieve live volumetric video streaming at a frame rate of 24 fps\nwith a latency of less than 350ms.\n","authors":["Kaiyuan Hu","Yongting Chen","Kaiying Han","Junhua Liu","Haowen Yang","Yili Jin","Boyan Li","Fangxin Wang"],"pdf_url":"https://arxiv.org/pdf/2310.08205v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08006v1","updated":"2023-10-12T03:19:13Z","published":"2023-10-12T03:19:13Z","title":"MCPNS: A Macropixel Collocated Position and Its Neighbors Search for\n Plenoptic 2.0 Video Coding","summary":" Recently, it was demonstrated that a newly focused plenoptic 2.0 camera can\ncapture much higher spatial resolution owing to its effective light field\nsampling, as compared to a traditional unfocused plenoptic 1.0 camera. However,\ndue to the nature difference of the optical structure between the plenoptic 1.0\nand 2.0 cameras, the existing fast motion estimation (ME) method for plenoptic\n1.0 videos is expected to be sub-optimal for encoding plenoptic 2.0 videos. In\nthis paper, we point out the main motion characteristic differences between\nplenoptic 1.0 and 2.0 videos and then propose a new fast ME, called macropixel\ncollocated position and its neighbors search (MCPNS) for plenoptic 2.0 videos.\nIn detail, we propose to reduce the number of macropixel collocated position\n(MCP) search candidates based on the new observation of center-biased motion\nvector distribution at macropixel resolution. After that, due to large motion\ndeviation behavior around each MCP location in plenoptic 2.0 videos, we propose\nto select a certain number of key MCP locations with the lowest matching cost\nto perform the neighbors MCP search to improve the motion search accuracy.\nDifferent from existing methods, our method can achieve better performance\nwithout requiring prior knowledge of microlens array orientations. Our\nsimulation results confirmed the effectiveness of the proposed algorithm in\nterms of both bitrate savings and computational costs compared to existing\nmethods.\n","authors":["Vinh Van Duong","Thuc Nguyen Huu","Jonghoon Yim","Byeungwoo Jeon"],"pdf_url":"https://arxiv.org/pdf/2310.08006v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.06366v4","updated":"2023-10-12T22:25:43Z","published":"2022-10-12T16:18:25Z","title":"A Generalist Framework for Panoptic Segmentation of Images and Videos","summary":" Panoptic segmentation assigns semantic and instance ID labels to every pixel\nof an image. As permutations of instance IDs are also valid solutions, the task\nrequires learning of high-dimensional one-to-many mapping. As a result,\nstate-of-the-art approaches use customized architectures and task-specific loss\nfunctions. We formulate panoptic segmentation as a discrete data generation\nproblem, without relying on inductive bias of the task. A diffusion model is\nproposed to model panoptic masks, with a simple architecture and generic loss\nfunction. By simply adding past predictions as a conditioning signal, our\nmethod is capable of modeling video (in a streaming setting) and thereby learns\nto track object instances automatically. With extensive experiments, we\ndemonstrate that our simple approach can perform competitively to\nstate-of-the-art specialist methods in similar settings.\n","authors":["Ting Chen","Lala Li","Saurabh Saxena","Geoffrey Hinton","David J. Fleet"],"pdf_url":"https://arxiv.org/pdf/2210.06366v4.pdf","comment":"ICCV'23. Code at https://github.com/google-research/pix2seq"},{"id":"http://arxiv.org/abs/2309.08730v2","updated":"2023-10-12T21:28:02Z","published":"2023-09-15T19:31:40Z","title":"MusiLingo: Bridging Music and Text with Pre-trained Language Models for\n Music Captioning and Query Response","summary":" Large Language Models (LLMs) have shown immense potential in multimodal\napplications, yet the convergence of textual and musical domains remains\nrelatively unexplored. To address this gap, we present MusiLingo, a novel\nsystem for music caption generation and music-related query responses.\nMusiLingo employs a single projection layer to align music representations from\nthe pre-trained frozen music audio model MERT with the frozen Vicuna-7B\nlanguage model (an adaption of LLaMA), bridging the gap between music audio and\ntextual contexts. We train it on an extensive music caption dataset and\nfine-tune it with instructional data. Due to the scarcity of high-quality music\nQ\\&A datasets, we created the Music Instruct (MI) dataset from captions in the\nMusicCaps datasets, tailored for open-ended music inquiries. Empirical\nevaluations demonstrate its competitive performance in generating music\ncaptions and composing music-related Q&A pairs.\n","authors":["Zihao Deng","Yinghao Ma","Yudong Liu","Rongchen Guo","Ge Zhang","Wenhu Chen","Wenhao Huang","Emmanouil Benetos"],"pdf_url":"https://arxiv.org/pdf/2309.08730v2.pdf","comment":null}]},"2023-10-13T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2310.09266v1","updated":"2023-10-13T17:24:52Z","published":"2023-10-13T17:24:52Z","title":"User Inference Attacks on Large Language Models","summary":" Fine-tuning is a common and effective method for tailoring large language\nmodels (LLMs) to specialized tasks and applications. In this paper, we study\nthe privacy implications of fine-tuning LLMs on user data. To this end, we\ndefine a realistic threat model, called user inference, wherein an attacker\ninfers whether or not a user's data was used for fine-tuning. We implement\nattacks for this threat model that require only a small set of samples from a\nuser (possibly different from the samples used for training) and black-box\naccess to the fine-tuned LLM. We find that LLMs are susceptible to user\ninference attacks across a variety of fine-tuning datasets, at times with near\nperfect attack success rates. Further, we investigate which properties make\nusers vulnerable to user inference, finding that outlier users (i.e. those with\ndata distributions sufficiently different from other users) and users who\ncontribute large quantities of data are most susceptible to attack. Finally, we\nexplore several heuristics for mitigating privacy attacks. We find that\ninterventions in the training algorithm, such as batch or per-example gradient\nclipping and early stopping fail to prevent user inference. However, limiting\nthe number of fine-tuning samples from a single user can reduce attack\neffectiveness, albeit at the cost of reducing the total amount of fine-tuning\ndata.\n","authors":["Nikhil Kandpal","Krishna Pillutla","Alina Oprea","Peter Kairouz","Christopher A. Choquette-Choo","Zheng Xu"],"pdf_url":"https://arxiv.org/pdf/2310.09266v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09265v1","updated":"2023-10-13T17:23:17Z","published":"2023-10-13T17:23:17Z","title":"PromptRE: Weakly-Supervised Document-Level Relation Extraction via\n Prompting-Based Data Programming","summary":" Relation extraction aims to classify the relationships between two entities\ninto pre-defined categories. While previous research has mainly focused on\nsentence-level relation extraction, recent studies have expanded the scope to\ndocument-level relation extraction. Traditional relation extraction methods\nheavily rely on human-annotated training data, which is time-consuming and\nlabor-intensive. To mitigate the need for manual annotation, recent\nweakly-supervised approaches have been developed for sentence-level relation\nextraction while limited work has been done on document-level relation\nextraction. Weakly-supervised document-level relation extraction faces\nsignificant challenges due to an imbalanced number \"no relation\" instances and\nthe failure of directly probing pretrained large language models for document\nrelation extraction. To address these challenges, we propose PromptRE, a novel\nweakly-supervised document-level relation extraction method that combines\nprompting-based techniques with data programming. Furthermore, PromptRE\nincorporates the label distribution and entity types as prior knowledge to\nimprove the performance. By leveraging the strengths of both prompting and data\nprogramming, PromptRE achieves improved performance in relation classification\nand effectively handles the \"no relation\" problem. Experimental results on\nReDocRED, a benchmark dataset for document-level relation extraction,\ndemonstrate the superiority of PromptRE over baseline approaches.\n","authors":["Chufan Gao","Xulin Fan","Jimeng Sun","Xuan Wang"],"pdf_url":"https://arxiv.org/pdf/2310.09265v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.00807v3","updated":"2023-10-13T17:23:04Z","published":"2023-03-01T20:21:23Z","title":"UDAPDR: Unsupervised Domain Adaptation via LLM Prompting and\n Distillation of Rerankers","summary":" Many information retrieval tasks require large labeled datasets for\nfine-tuning. However, such datasets are often unavailable, and their utility\nfor real-world applications can diminish quickly due to domain shifts. To\naddress this challenge, we develop and motivate a method for using large\nlanguage models (LLMs) to generate large numbers of synthetic queries cheaply.\nThe method begins by generating a small number of synthetic queries using an\nexpensive LLM. After that, a much less expensive one is used to create large\nnumbers of synthetic queries, which are used to fine-tune a family of reranker\nmodels. These rerankers are then distilled into a single efficient retriever\nfor use in the target domain. We show that this technique boosts zero-shot\naccuracy in long-tail domains and achieves substantially lower latency than\nstandard reranking methods.\n","authors":["Jon Saad-Falcon","Omar Khattab","Keshav Santhanam","Radu Florian","Martin Franz","Salim Roukos","Avirup Sil","Md Arafat Sultan","Christopher Potts"],"pdf_url":"https://arxiv.org/pdf/2303.00807v3.pdf","comment":"Long Paper at Empirical Methods in Natural Language Processing\n (EMNLP) 2023"},{"id":"http://arxiv.org/abs/2310.09263v1","updated":"2023-10-13T17:20:56Z","published":"2023-10-13T17:20:56Z","title":"Table-GPT: Table-tuned GPT for Diverse Table Tasks","summary":" Language models, such as GPT-3.5 and ChatGPT, demonstrate remarkable\nabilities to follow diverse human instructions and perform a wide range of\ntasks. However, when probing language models using a range of basic\ntable-understanding tasks, we observe that today's language models are still\nsub-optimal in many table-related tasks, likely because they are pre-trained\npredominantly on \\emph{one-dimensional} natural-language texts, whereas\nrelational tables are \\emph{two-dimensional} objects.\n In this work, we propose a new \"\\emph{table-tuning}\" paradigm, where we\ncontinue to train/fine-tune language models like GPT-3.5 and ChatGPT, using\ndiverse table-tasks synthesized from real tables as training data, with the\ngoal of enhancing language models' ability to understand tables and perform\ntable tasks. We show that our resulting Table-GPT models demonstrate (1) better\n\\emph{table-understanding} capabilities, by consistently outperforming the\nvanilla GPT-3.5 and ChatGPT, on a wide-range of table tasks, including holdout\nunseen tasks, and (2) strong \\emph{generalizability}, in its ability to respond\nto diverse human instructions to perform new table-tasks, in a manner similar\nto GPT-3.5 and ChatGPT.\n","authors":["Peng Li","Yeye He","Dror Yashar","Weiwei Cui","Song Ge","Haidong Zhang","Danielle Rifinski Fainman","Dongmei Zhang","Surajit Chaudhuri"],"pdf_url":"https://arxiv.org/pdf/2310.09263v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09256v1","updated":"2023-10-13T17:13:00Z","published":"2023-10-13T17:13:00Z","title":"Political claim identification and categorization in a multilingual\n setting: First experiments","summary":" The identification and classification of political claims is an important\nstep in the analysis of political newspaper reports; however, resources for\nthis task are few and far between. This paper explores different strategies for\nthe cross-lingual projection of political claims analysis. We conduct\nexperiments on a German dataset, DebateNet2.0, covering the policy debate\nsparked by the 2015 refugee crisis. Our evaluation involves two tasks (claim\nidentification and categorization), three languages (German, English, and\nFrench) and two methods (machine translation -- the best method in our\nexperiments -- and multilingual embeddings).\n","authors":["Urs Zaberer","Sebastian Padó","Gabriella Lapesa"],"pdf_url":"https://arxiv.org/pdf/2310.09256v1.pdf","comment":"Presented at KONVENS 2023, Ingolstadt, Germany"},{"id":"http://arxiv.org/abs/2307.02599v2","updated":"2023-10-13T17:01:11Z","published":"2023-07-05T18:48:28Z","title":"Evade ChatGPT Detectors via A Single Space","summary":" ChatGPT brings revolutionary social value but also raises concerns about the\nmisuse of AI-generated text. Consequently, an important question is how to\ndetect whether texts are generated by ChatGPT or by human. Existing detectors\nare built upon the assumption that there are distributional gaps between\nhuman-generated and AI-generated text. These gaps are typically identified\nusing statistical information or classifiers. Our research challenges the\ndistributional gap assumption in detectors. We find that detectors do not\neffectively discriminate the semantic and stylistic gaps between\nhuman-generated and AI-generated text. Instead, the \"subtle differences\", such\nas an extra space, become crucial for detection. Based on this discovery, we\npropose the SpaceInfi strategy to evade detection. Experiments demonstrate the\neffectiveness of this strategy across multiple benchmarks and detectors. We\nalso provide a theoretical explanation for why SpaceInfi is successful in\nevading perplexity-based detection. And we empirically show that a phenomenon\ncalled token mutation causes the evasion for language model-based detectors.\nOur findings offer new insights and challenges for understanding and\nconstructing more applicable ChatGPT detectors.\n","authors":["Shuyang Cai","Wanyun Cui"],"pdf_url":"https://arxiv.org/pdf/2307.02599v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09247v1","updated":"2023-10-13T16:53:25Z","published":"2023-10-13T16:53:25Z","title":"Hypernymy Understanding Evaluation of Text-to-Image Models via WordNet\n Hierarchy","summary":" Text-to-image synthesis has recently attracted widespread attention due to\nrapidly improving quality and numerous practical applications. However, the\nlanguage understanding capabilities of text-to-image models are still poorly\nunderstood, which makes it difficult to reason about prompt formulations that a\ngiven model would understand well. In this work, we measure the capability of\npopular text-to-image models to understand $\\textit{hypernymy}$, or the \"is-a\"\nrelation between words. We design two automatic metrics based on the WordNet\nsemantic hierarchy and existing image classifiers pretrained on ImageNet. These\nmetrics both enable broad quantitative comparison of linguistic capabilities\nfor text-to-image models and offer a way of finding fine-grained qualitative\ndifferences, such as words that are unknown to models and thus are difficult\nfor them to draw. We comprehensively evaluate popular text-to-image models,\nincluding GLIDE, Latent Diffusion, and Stable Diffusion, showing how our\nmetrics can provide a better understanding of the individual strengths and\nweaknesses of these models.\n","authors":["Anton Baryshnikov","Max Ryabinin"],"pdf_url":"https://arxiv.org/pdf/2310.09247v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09241v1","updated":"2023-10-13T16:47:20Z","published":"2023-10-13T16:47:20Z","title":"Precedent-Enhanced Legal Judgment Prediction with LLM and Domain-Model\n Collaboration","summary":" Legal Judgment Prediction (LJP) has become an increasingly crucial task in\nLegal AI, i.e., predicting the judgment of the case in terms of case fact\ndescription. Precedents are the previous legal cases with similar facts, which\nare the basis for the judgment of the subsequent case in national legal\nsystems. Thus, it is worthwhile to explore the utilization of precedents in the\nLJP. Recent advances in deep learning have enabled a variety of techniques to\nbe used to solve the LJP task. These can be broken down into two categories:\nlarge language models (LLMs) and domain-specific models. LLMs are capable of\ninterpreting and generating complex natural language, while domain models are\nefficient in learning task-specific information. In this paper, we propose the\nprecedent-enhanced LJP framework (PLJP), a system that leverages the strength\nof both LLM and domain models in the context of precedents. Specifically, the\ndomain models are designed to provide candidate labels and find the proper\nprecedents efficiently, and the large models will make the final prediction\nwith an in-context precedents comprehension. Experiments on the real-world\ndataset demonstrate the effectiveness of our PLJP. Moreover, our work shows a\npromising direction for LLM and domain-model collaboration that can be\ngeneralized to other vertical domains.\n","authors":["Yiquan Wu","Siying Zhou","Yifei Liu","Weiming Lu","Xiaozhong Liu","Yating Zhang","Changlong Sun","Fei Wu","Kun Kuang"],"pdf_url":"https://arxiv.org/pdf/2310.09241v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09238v1","updated":"2023-10-13T16:46:38Z","published":"2023-10-13T16:46:38Z","title":"BanglaNLP at BLP-2023 Task 2: Benchmarking different Transformer Models\n for Sentiment Analysis of Bangla Social Media Posts","summary":" Bangla is the 7th most widely spoken language globally, with a staggering 234\nmillion native speakers primarily hailing from India and Bangladesh. This\nmorphologically rich language boasts a rich literary tradition, encompassing\ndiverse dialects and language-specific challenges. Despite its linguistic\nrichness and history, Bangla remains categorized as a low-resource language\nwithin the natural language processing (NLP) and speech community. This paper\npresents our submission to Task 2 (Sentiment Analysis of Bangla Social Media\nPosts) of the BLP Workshop. We experiment with various Transformer-based\narchitectures to solve this task. Our quantitative results show that transfer\nlearning really helps in better learning of the models in this low-resource\nlanguage scenario. This becomes evident when we further finetune a model which\nhas already been finetuned on twitter data for sentiment analysis task and that\nfinetuned model performs the best among all other models. We also perform a\ndetailed error analysis where we find some instances where ground truth labels\nneed to be relooked at. We obtain a micro-F1 of 67.02\\% on the test set and our\nperformance in this shared task is ranked at 21 in the leaderboard.\n","authors":["Saumajit Saha","Albert Nanda"],"pdf_url":"https://arxiv.org/pdf/2310.09238v1.pdf","comment":"7 pages, 2 figures, workshop"},{"id":"http://arxiv.org/abs/2310.09233v1","updated":"2023-10-13T16:37:14Z","published":"2023-10-13T16:37:14Z","title":"AgentCF: Collaborative Learning with Autonomous Language Agents for\n Recommender Systems","summary":" Recently, there has been an emergence of employing LLM-powered agents as\nbelievable human proxies, based on their remarkable decision-making capability.\nHowever, existing studies mainly focus on simulating human dialogue. Human\nnon-verbal behaviors, such as item clicking in recommender systems, although\nimplicitly exhibiting user preferences and could enhance the modeling of users,\nhave not been deeply explored. The main reasons lie in the gap between language\nmodeling and behavior modeling, as well as the incomprehension of LLMs about\nuser-item relations.\n To address this issue, we propose AgentCF for simulating user-item\ninteractions in recommender systems through agent-based collaborative\nfiltering. We creatively consider not only users but also items as agents, and\ndevelop a collaborative learning approach that optimizes both kinds of agents\ntogether. Specifically, at each time step, we first prompt the user and item\nagents to interact autonomously. Then, based on the disparities between the\nagents' decisions and real-world interaction records, user and item agents are\nprompted to reflect on and adjust the misleading simulations collaboratively,\nthereby modeling their two-sided relations. The optimized agents can also\npropagate their preferences to other agents in subsequent interactions,\nimplicitly capturing the collaborative filtering idea. Overall, the optimized\nagents exhibit diverse interaction behaviors within our framework, including\nuser-item, user-user, item-item, and collective interactions. The results show\nthat these agents can demonstrate personalized behaviors akin to those of\nreal-world individuals, sparking the development of next-generation user\nbehavior simulation.\n","authors":["Junjie Zhang","Yupeng Hou","Ruobing Xie","Wenqi Sun","Julian McAuley","Wayne Xin Zhao","Leyu Lin","Ji-Rong Wen"],"pdf_url":"https://arxiv.org/pdf/2310.09233v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09223v1","updated":"2023-10-13T16:21:07Z","published":"2023-10-13T16:21:07Z","title":"Automated Claim Matching with Large Language Models: Empowering\n Fact-Checkers in the Fight Against Misinformation","summary":" In today's digital era, the rapid spread of misinformation poses threats to\npublic well-being and societal trust. As online misinformation proliferates,\nmanual verification by fact checkers becomes increasingly challenging. We\nintroduce FACT-GPT (Fact-checking Augmentation with Claim matching\nTask-oriented Generative Pre-trained Transformer), a framework designed to\nautomate the claim matching phase of fact-checking using Large Language Models\n(LLMs). This framework identifies new social media content that either supports\nor contradicts claims previously debunked by fact-checkers. Our approach\nemploys GPT-4 to generate a labeled dataset consisting of simulated social\nmedia posts. This data set serves as a training ground for fine-tuning more\nspecialized LLMs. We evaluated FACT-GPT on an extensive dataset of social media\ncontent related to public health. The results indicate that our fine-tuned LLMs\nrival the performance of larger pre-trained LLMs in claim matching tasks,\naligning closely with human annotations. This study achieves three key\nmilestones: it provides an automated framework for enhanced fact-checking;\ndemonstrates the potential of LLMs to complement human expertise; offers public\nresources, including datasets and models, to further research and applications\nin the fact-checking domain.\n","authors":["Eun Cheol Choi","Emilio Ferrara"],"pdf_url":"https://arxiv.org/pdf/2310.09223v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.03026v2","updated":"2023-10-13T16:13:43Z","published":"2023-10-04T17:59:49Z","title":"LanguageMPC: Large Language Models as Decision Makers for Autonomous\n Driving","summary":" Existing learning-based autonomous driving (AD) systems face challenges in\ncomprehending high-level information, generalizing to rare events, and\nproviding interpretability. To address these problems, this work employs Large\nLanguage Models (LLMs) as a decision-making component for complex AD scenarios\nthat require human commonsense understanding. We devise cognitive pathways to\nenable comprehensive reasoning with LLMs, and develop algorithms for\ntranslating LLM decisions into actionable driving commands. Through this\napproach, LLM decisions are seamlessly integrated with low-level controllers by\nguided parameter matrix adaptation. Extensive experiments demonstrate that our\nproposed method not only consistently surpasses baseline approaches in\nsingle-vehicle tasks, but also helps handle complex driving behaviors even\nmulti-vehicle coordination, thanks to the commonsense reasoning capabilities of\nLLMs. This paper presents an initial step toward leveraging LLMs as effective\ndecision-makers for intricate AD scenarios in terms of safety, efficiency,\ngeneralizability, and interoperability. We aspire for it to serve as\ninspiration for future research in this field. Project page:\nhttps://sites.google.com/view/llm-mpc\n","authors":["Hao Sha","Yao Mu","Yuxuan Jiang","Li Chen","Chenfeng Xu","Ping Luo","Shengbo Eben Li","Masayoshi Tomizuka","Wei Zhan","Mingyu Ding"],"pdf_url":"https://arxiv.org/pdf/2310.03026v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09219v1","updated":"2023-10-13T16:12:57Z","published":"2023-10-13T16:12:57Z","title":"\"Kelly is a Warm Person, Joseph is a Role Model\": Gender Biases in\n LLM-Generated Reference Letters","summary":" As generative language models advance, users have started to utilize Large\nLanguage Models (LLMs) to assist in writing various types of content, including\nprofessional documents such as recommendation letters. Despite their\nconvenience, these applications introduce unprecedented fairness concerns. As\ngenerated reference letters might be directly utilized by users in professional\nor academic scenarios, they have the potential to cause direct social harms,\nsuch as lowering success rates for female applicants. Therefore, it is imminent\nand necessary to comprehensively study fairness issues and associated harms in\nsuch real-world use cases for future mitigation and monitoring. In this paper,\nwe critically examine gender bias in LLM-generated reference letters. Inspired\nby findings in social science, we design evaluation methods to manifest gender\nbiases in LLM-generated letters through 2 dimensions: biases in language style\nand biases in lexical content. Furthermore, we investigate the extent of bias\npropagation by separately analyze bias amplification in model-hallucinated\ncontents, which we define to be the hallucination bias of model-generated\ndocuments. Through benchmarking evaluation on 4 popular LLMs, including\nChatGPT, Alpaca, Vicuna and StableLM, our study reveals significant gender\nbiases in LLM-generated recommendation letters. Our findings further point\ntowards the importance and imminence to recognize biases in LLM-generated\nprofessional documents.\n","authors":["Yixin Wan","George Pu","Jiao Sun","Aparna Garimella","Kai-Wei Chang","Nanyun Peng"],"pdf_url":"https://arxiv.org/pdf/2310.09219v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.01248v3","updated":"2023-10-13T15:53:00Z","published":"2023-03-01T06:16:14Z","title":"Can ChatGPT Assess Human Personalities? A General Evaluation Framework","summary":" Large Language Models (LLMs) especially ChatGPT have produced impressive\nresults in various areas, but their potential human-like psychology is still\nlargely unexplored. Existing works study the virtual personalities of LLMs but\nrarely explore the possibility of analyzing human personalities via LLMs. This\npaper presents a generic evaluation framework for LLMs to assess human\npersonalities based on Myers Briggs Type Indicator (MBTI) tests. Specifically,\nwe first devise unbiased prompts by randomly permuting options in MBTI\nquestions and adopt the average testing result to encourage more impartial\nanswer generation. Then, we propose to replace the subject in question\nstatements to enable flexible queries and assessments on different subjects\nfrom LLMs. Finally, we re-formulate the question instructions in a manner of\ncorrectness evaluation to facilitate LLMs to generate clearer responses. The\nproposed framework enables LLMs to flexibly assess personalities of different\ngroups of people. We further propose three evaluation metrics to measure the\nconsistency, robustness, and fairness of assessment results from\nstate-of-the-art LLMs including ChatGPT and GPT-4. Our experiments reveal\nChatGPT's ability to assess human personalities, and the average results\ndemonstrate that it can achieve more consistent and fairer assessments in spite\nof lower robustness against prompt biases compared with InstructGPT.\n","authors":["Haocong Rao","Cyril Leung","Chunyan Miao"],"pdf_url":"https://arxiv.org/pdf/2303.01248v3.pdf","comment":"Accepted to EMNLP 2023. Our codes are available at\n https://github.com/Kali-Hac/ChatGPT-MBTI"},{"id":"http://arxiv.org/abs/2308.01497v2","updated":"2023-10-13T15:51:46Z","published":"2023-08-03T01:46:27Z","title":"Large Language Model Displays Emergent Ability to Interpret Novel\n Literary Metaphors","summary":" Recent advances in the performance of large language models (LLMs) have\nsparked debate over whether, given sufficient training, high-level human\nabilities emerge in such generic forms of artificial intelligence (AI). Despite\nthe exceptional performance of LLMs on a wide range of tasks involving natural\nlanguage processing and reasoning, there has been sharp disagreement as to\nwhether their abilities extend to more creative human abilities. A core example\nis the ability to interpret novel metaphors. Given the enormous and non curated\ntext corpora used to train LLMs, a serious obstacle to designing tests is the\nrequirement of finding novel yet high quality metaphors that are unlikely to\nhave been included in the training data. Here we assessed the ability of GPT4,\na state of the art large language model, to provide natural-language\ninterpretations of novel literary metaphors drawn from Serbian poetry and\ntranslated into English. Despite exhibiting no signs of having been exposed to\nthese metaphors previously, the AI system consistently produced detailed and\nincisive interpretations. Human judges, blind to the fact that an AI model was\ninvolved, rated metaphor interpretations generated by GPT4 as superior to those\nprovided by a group of college students. In interpreting reversed metaphors,\nGPT4, as well as humans, exhibited signs of sensitivity to the Gricean\ncooperative principle. In addition, for several novel English poems GPT4\nproduced interpretations that were rated as excellent or good by a human\nliterary critic. These results indicate that LLMs such as GPT4 have acquired an\nemergent ability to interpret complex metaphors, including those embedded in\nnovel poems.\n","authors":["Nicholas Ichien","Dušan Stamenković","Keith J. Holyoak"],"pdf_url":"https://arxiv.org/pdf/2308.01497v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2205.10977v2","updated":"2023-10-13T15:38:46Z","published":"2022-05-23T00:57:33Z","title":"What should I Ask: A Knowledge-driven Approach for Follow-up Questions\n Generation in Conversational Surveys","summary":" Generating follow-up questions on the fly could significantly improve\nconversational survey quality and user experiences by enabling a more dynamic\nand personalized survey structure. In this paper, we proposed a novel task for\nknowledge-driven follow-up question generation in conversational surveys. We\nconstructed a new human-annotated dataset of human-written follow-up questions\nwith dialogue history and labeled knowledge in the context of conversational\nsurveys. Along with the dataset, we designed and validated a set of\nreference-free Gricean-inspired evaluation metrics to systematically evaluate\nthe quality of generated follow-up questions. We then propose a two-staged\nknowledge-driven model for the task, which generates informative and coherent\nfollow-up questions by using knowledge to steer the generation process. The\nexperiments demonstrate that compared to GPT-based baseline models, our\ntwo-staged model generates more informative, coherent, and clear follow-up\nquestions.\n","authors":["Yubin Ge","Ziang Xiao","Jana Diesner","Heng Ji","Karrie Karahalios","Hari Sundaram"],"pdf_url":"https://arxiv.org/pdf/2205.10977v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.17216v3","updated":"2023-10-13T15:35:42Z","published":"2023-05-26T19:22:03Z","title":"Generating Images with Multimodal Language Models","summary":" We propose a method to fuse frozen text-only large language models (LLMs)\nwith pre-trained image encoder and decoder models, by mapping between their\nembedding spaces. Our model demonstrates a wide suite of multimodal\ncapabilities: image retrieval, novel image generation, and multimodal dialogue.\nOurs is the first approach capable of conditioning on arbitrarily interleaved\nimage and text inputs to generate coherent image (and text) outputs. To achieve\nstrong performance on image generation, we propose an efficient mapping network\nto ground the LLM to an off-the-shelf text-to-image generation model. This\nmapping network translates hidden representations of text into the embedding\nspace of the visual models, enabling us to leverage the strong text\nrepresentations of the LLM for visual outputs. Our approach outperforms\nbaseline generation models on tasks with longer and more complex language. In\naddition to novel image generation, our model is also capable of image\nretrieval from a prespecified dataset, and decides whether to retrieve or\ngenerate at inference time. This is done with a learnt decision module which\nconditions on the hidden representations of the LLM. Our model exhibits a wider\nrange of capabilities compared to prior multimodal language models. It can\nprocess image-and-text inputs, and produce retrieved images, generated images,\nand generated text -- outperforming non-LLM based generation models across\nseveral text-to-image tasks that measure context dependence.\n","authors":["Jing Yu Koh","Daniel Fried","Ruslan Salakhutdinov"],"pdf_url":"https://arxiv.org/pdf/2305.17216v3.pdf","comment":"NeurIPS 2023. Project page: http://jykoh.com/gill"},{"id":"http://arxiv.org/abs/2308.02490v2","updated":"2023-10-13T15:16:59Z","published":"2023-08-04T17:59:47Z","title":"MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities","summary":" We propose MM-Vet, an evaluation benchmark that examines large multimodal\nmodels (LMMs) on complicated multimodal tasks. Recent LMMs have shown various\nintriguing abilities, such as solving math problems written on the blackboard,\nreasoning about events and celebrities in news images, and explaining visual\njokes. Rapid model advancements pose challenges to evaluation benchmark\ndevelopment. Problems include: (1) How to systematically structure and evaluate\nthe complicated multimodal tasks; (2) How to design evaluation metrics that\nwork well across question and answer types; and (3) How to give model insights\nbeyond a simple performance ranking. To this end, we present MM-Vet, designed\nbased on the insight that the intriguing ability to solve complicated tasks is\noften achieved by a generalist model being able to integrate different core\nvision-language (VL) capabilities. MM-Vet defines 6 core VL capabilities and\nexamines the 16 integrations of interest derived from the capability\ncombination. For evaluation metrics, we propose an LLM-based evaluator for\nopen-ended outputs. The evaluator enables the evaluation across different\nquestion types and answer styles, resulting in a unified scoring metric. We\nevaluate representative LMMs on MM-Vet, providing insights into the\ncapabilities of different LMM system paradigms and models. Code and data are\navailable at https://github.com/yuweihao/MM-Vet.\n","authors":["Weihao Yu","Zhengyuan Yang","Linjie Li","Jianfeng Wang","Kevin Lin","Zicheng Liu","Xinchao Wang","Lijuan Wang"],"pdf_url":"https://arxiv.org/pdf/2308.02490v2.pdf","comment":"Update results of OpenFlamingo-9B (MPT), LLaMA-Adapter v2-7B, and\n Otter-9B (MPT). Code, data and leaderboard:\n https://github.com/yuweihao/MM-Vet"},{"id":"http://arxiv.org/abs/2310.05597v2","updated":"2023-10-13T15:07:28Z","published":"2023-10-09T10:34:38Z","title":"Can language models learn analogical reasoning? Investigating training\n objectives and comparisons to human performance","summary":" While analogies are a common way to evaluate word embeddings in NLP, it is\nalso of interest to investigate whether or not analogical reasoning is a task\nin itself that can be learned. In this paper, we test several ways to learn\nbasic analogical reasoning, specifically focusing on analogies that are more\ntypical of what is used to evaluate analogical reasoning in humans than those\nin commonly used NLP benchmarks. Our experiments find that models are able to\nlearn analogical reasoning, even with a small amount of data. We additionally\ncompare our models to a dataset with a human baseline, and find that after\ntraining, models approach human performance.\n","authors":["Molly R. Petersen","Lonneke van der Plas"],"pdf_url":"https://arxiv.org/pdf/2310.05597v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09168v1","updated":"2023-10-13T15:03:15Z","published":"2023-10-13T15:03:15Z","title":"Explore-Instruct: Enhancing Domain-Specific Instruction Coverage through\n Active Exploration","summary":" Instruction-tuning can be substantially optimized through enhanced diversity,\nresulting in models capable of handling a broader spectrum of tasks. However,\nexisting data employed for such tuning often exhibit an inadequate coverage of\nindividual domains, limiting the scope for nuanced comprehension and\ninteractions within these areas. To address this deficiency, we propose\nExplore-Instruct, a novel approach to enhance the data coverage to be used in\ndomain-specific instruction-tuning through active exploration via Large\nLanguage Models (LLMs). Built upon representative domain use cases,\nExplore-Instruct explores a multitude of variations or possibilities by\nimplementing a search algorithm to obtain diversified and domain-focused\ninstruction-tuning data. Our data-centric analysis validates the effectiveness\nof this proposed approach in improving domain-specific instruction coverage.\nMoreover, our model's performance demonstrates considerable advancements over\nmultiple baselines, including those utilizing domain-specific data enhancement.\nOur findings offer a promising opportunity to improve instruction coverage,\nespecially in domain-specific contexts, thereby advancing the development of\nadaptable language models. Our code, model weights, and data are public at\n\\url{https://github.com/fanqiwan/Explore-Instruct}.\n","authors":["Fanqi Wan","Xinting Huang","Tao Yang","Xiaojun Quan","Wei Bi","Shuming Shi"],"pdf_url":"https://arxiv.org/pdf/2310.09168v1.pdf","comment":"Accepted to EMNLP 2023 (Main Conference)"},{"id":"http://arxiv.org/abs/2310.09166v1","updated":"2023-10-13T15:01:17Z","published":"2023-10-13T15:01:17Z","title":"Developing a Natural Language Understanding Model to Characterize Cable\n News Bias","summary":" Media bias has been extensively studied by both social and computational\nsciences. However, current work still has a large reliance on human input and\nsubjective assessment to label biases. This is especially true for cable news\nresearch. To address these issues, we develop an unsupervised machine learning\nmethod to characterize the bias of cable news programs without any human input.\nThis method relies on the analysis of what topics are mentioned through Named\nEntity Recognition and how those topics are discussed through Stance Analysis\nin order to cluster programs with similar biases together. Applying our method\nto 2020 cable news transcripts, we find that program clusters are consistent\nover time and roughly correspond to the cable news network of the program. This\nmethod reveals the potential for future tools to objectively assess media bias\nand characterize unfamiliar media environments.\n","authors":["Seth P. Benson","Iain J. Cruickshank"],"pdf_url":"https://arxiv.org/pdf/2310.09166v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09151v1","updated":"2023-10-13T14:44:34Z","published":"2023-10-13T14:44:34Z","title":"BibRank: Automatic Keyphrase Extraction Platform Using~Metadata","summary":" Automatic Keyphrase Extraction involves identifying essential phrases in a\ndocument. These keyphrases are crucial in various tasks such as document\nclassification, clustering, recommendation, indexing, searching, summarization,\nand text simplification. This paper introduces a platform that integrates\nkeyphrase datasets and facilitates the evaluation of keyphrase extraction\nalgorithms. The platform includes BibRank, an automatic keyphrase extraction\nalgorithm that leverages a rich dataset obtained by parsing bibliographic data\nin BibTeX format. BibRank combines innovative weighting techniques with\npositional, statistical, and word co-occurrence information to extract\nkeyphrases from documents. The platform proves valuable for researchers and\ndevelopers seeking to enhance their keyphrase extraction algorithms and advance\nthe field of natural language processing.\n","authors":["Abdelrhman Eldallal","Eduard Barbu"],"pdf_url":"https://arxiv.org/pdf/2310.09151v1.pdf","comment":"12 pages , 4 figures, 8 tables"},{"id":"http://arxiv.org/abs/2310.09141v1","updated":"2023-10-13T14:33:02Z","published":"2023-10-13T14:33:02Z","title":"PuoBERTa: Training and evaluation of a curated language model for\n Setswana","summary":" Natural language processing (NLP) has made significant progress for\nwell-resourced languages such as English but lagged behind for low-resource\nlanguages like Setswana. This paper addresses this gap by presenting PuoBERTa,\na customised masked language model trained specifically for Setswana. We cover\nhow we collected, curated, and prepared diverse monolingual texts to generate a\nhigh-quality corpus for PuoBERTa's training. Building upon previous efforts in\ncreating monolingual resources for Setswana, we evaluated PuoBERTa across\nseveral NLP tasks, including part-of-speech (POS) tagging, named entity\nrecognition (NER), and news categorisation. Additionally, we introduced a new\nSetswana news categorisation dataset and provided the initial benchmarks using\nPuoBERTa. Our work demonstrates the efficacy of PuoBERTa in fostering NLP\ncapabilities for understudied languages like Setswana and paves the way for\nfuture research directions.\n","authors":["Vukosi Marivate","Moseli Mots'Oehli","Valencia Wagner","Richard Lastrucci","Isheanesu Dzingirai"],"pdf_url":"https://arxiv.org/pdf/2310.09141v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.00639v2","updated":"2023-10-13T14:28:03Z","published":"2023-06-01T13:06:43Z","title":"Being Right for Whose Right Reasons?","summary":" Explainability methods are used to benchmark the extent to which model\npredictions align with human rationales i.e., are 'right for the right\nreasons'. Previous work has failed to acknowledge, however, that what counts as\na rationale is sometimes subjective. This paper presents what we think is a\nfirst of its kind, a collection of human rationale annotations augmented with\nthe annotators demographic information. We cover three datasets spanning\nsentiment analysis and common-sense reasoning, and six demographic groups\n(balanced across age and ethnicity). Such data enables us to ask both what\ndemographics our predictions align with and whose reasoning patterns our\nmodels' rationales align with. We find systematic inter-group annotator\ndisagreement and show how 16 Transformer-based models align better with\nrationales provided by certain demographic groups: We find that models are\nbiased towards aligning best with older and/or white annotators. We zoom in on\nthe effects of model size and model distillation, finding -- contrary to our\nexpectations -- negative correlations between model size and rationale\nagreement as well as no evidence that either model size or model distillation\nimproves fairness.\n","authors":["Terne Sasha Thorn Jakobsen","Laura Cabello","Anders Søgaard"],"pdf_url":"https://arxiv.org/pdf/2306.00639v2.pdf","comment":"In Proceedings of ACL 2023"},{"id":"http://arxiv.org/abs/2310.09139v1","updated":"2023-10-13T14:27:21Z","published":"2023-10-13T14:27:21Z","title":"The Consensus Game: Language Model Generation via Equilibrium Search","summary":" When applied to question answering and other text generation tasks, language\nmodels (LMs) may be queried generatively (by sampling answers from their output\ndistribution) or discriminatively (by using them to score or rank a set of\ncandidate outputs). These procedures sometimes yield very different\npredictions. How do we reconcile mutually incompatible scoring procedures to\nobtain coherent LM predictions? We introduce a new, a training-free,\ngame-theoretic procedure for language model decoding. Our approach casts\nlanguage model decoding as a regularized imperfect-information sequential\nsignaling game - which we term the CONSENSUS GAME - in which a GENERATOR seeks\nto communicate an abstract correctness parameter using natural language\nsentences to a DISCRIMINATOR. We develop computational procedures for finding\napproximate equilibria of this game, resulting in a decoding algorithm we call\nEQUILIBRIUM-RANKING. Applied to a large number of tasks (including reading\ncomprehension, commonsense reasoning, mathematical problem-solving, and\ndialog), EQUILIBRIUM-RANKING consistently, and sometimes substantially,\nimproves performance over existing LM decoding procedures - on multiple\nbenchmarks, we observe that applying EQUILIBRIUM-RANKING to LLaMA-7B\noutperforms the much larger LLaMA-65B and PaLM-540B models. These results\nhighlight the promise of game-theoretic tools for addressing fundamental\nchallenges of truthfulness and consistency in LMs.\n","authors":["Athul Paul Jacob","Yikang Shen","Gabriele Farina","Jacob Andreas"],"pdf_url":"https://arxiv.org/pdf/2310.09139v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07321v2","updated":"2023-10-13T14:24:31Z","published":"2023-10-11T09:09:55Z","title":"On the Impact of Cross-Domain Data on German Language Models","summary":" Traditionally, large language models have been either trained on general web\ncrawls or domain-specific data. However, recent successes of generative large\nlanguage models, have shed light on the benefits of cross-domain datasets. To\nexamine the significance of prioritizing data diversity over quality, we\npresent a German dataset comprising texts from five domains, along with another\ndataset aimed at containing high-quality data. Through training a series of\nmodels ranging between 122M and 750M parameters on both datasets, we conduct a\ncomprehensive benchmark on multiple downstream tasks. Our findings demonstrate\nthat the models trained on the cross-domain dataset outperform those trained on\nquality data alone, leading to improvements up to $4.45\\%$ over the previous\nstate-of-the-art. The models are available at\nhttps://huggingface.co/ikim-uk-essen\n","authors":["Amin Dada","Aokun Chen","Cheng Peng","Kaleb E Smith","Ahmad Idrissi-Yaghir","Constantin Marc Seibold","Jianning Li","Lars Heiliger","Xi Yang","Christoph M. Friedrich","Daniel Truhn","Jan Egger","Jiang Bian","Jens Kleesiek","Yonghui Wu"],"pdf_url":"https://arxiv.org/pdf/2310.07321v2.pdf","comment":"13 pages, 1 figure, accepted at Findings of the Association for\n Computational Linguistics: EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.09135v1","updated":"2023-10-13T14:23:33Z","published":"2023-10-13T14:23:33Z","title":"HierarchicalContrast: A Coarse-to-Fine Contrastive Learning Framework\n for Cross-Domain Zero-Shot Slot Filling","summary":" In task-oriented dialogue scenarios, cross-domain zero-shot slot filling\nplays a vital role in leveraging source domain knowledge to learn a model with\nhigh generalization ability in unknown target domain where annotated data is\nunavailable. However, the existing state-of-the-art zero-shot slot filling\nmethods have limited generalization ability in target domain, they only show\neffective knowledge transfer on seen slots and perform poorly on unseen slots.\nTo alleviate this issue, we present a novel Hierarchical Contrastive Learning\nFramework (HiCL) for zero-shot slot filling. Specifically, we propose a coarse-\nto fine-grained contrastive learning based on Gaussian-distributed embedding to\nlearn the generalized deep semantic relations between utterance-tokens, by\noptimizing inter- and intra-token distribution distance. This encourages HiCL\nto generalize to the slot types unseen at training phase. Furthermore, we\npresent a new iterative label set semantics inference method to unbiasedly and\nseparately evaluate the performance of unseen slot types which entangled with\ntheir counterparts (i.e., seen slot types) in the previous zero-shot slot\nfilling evaluation methods. The extensive empirical experiments on four\ndatasets demonstrate that the proposed method achieves comparable or even\nbetter performance than the current state-of-the-art zero-shot slot filling\napproaches.\n","authors":["Junwen Zhang","Yin Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.09135v1.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.09119v1","updated":"2023-10-13T14:03:01Z","published":"2023-10-13T14:03:01Z","title":"A Frustratingly Easy Plug-and-Play Detection-and-Reasoning Module for\n Chinese Spelling Check","summary":" In recent years, Chinese Spelling Check (CSC) has been greatly improved by\ndesigning task-specific pre-training methods or introducing auxiliary tasks,\nwhich mostly solve this task in an end-to-end fashion. In this paper, we\npropose to decompose the CSC workflow into detection, reasoning, and searching\nsubtasks so that the rich external knowledge about the Chinese language can be\nleveraged more directly and efficiently. Specifically, we design a\nplug-and-play detection-and-reasoning module that is compatible with existing\nSOTA non-autoregressive CSC models to further boost their performance. We find\nthat the detection-and-reasoning module trained for one model can also benefit\nother models. We also study the primary interpretability provided by the task\ndecomposition. Extensive experiments and detailed analyses demonstrate the\neffectiveness and competitiveness of the proposed module.\n","authors":["Haojing Huang","Jingheng Ye","Qingyu Zhou","Yinghui Li","Yangning Li","Feng Zhou","Hai-Tao Zheng"],"pdf_url":"https://arxiv.org/pdf/2310.09119v1.pdf","comment":"Accepted for publication in Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2308.16687v2","updated":"2023-10-13T13:56:28Z","published":"2023-08-31T12:43:18Z","title":"DictaBERT: A State-of-the-Art BERT Suite for Modern Hebrew","summary":" We present DictaBERT, a new state-of-the-art pre-trained BERT model for\nmodern Hebrew, outperforming existing models on most benchmarks. Additionally,\nwe release three fine-tuned versions of the model, designed to perform three\nspecific foundational tasks in the analysis of Hebrew texts: prefix\nsegmentation, morphological tagging and question answering. These fine-tuned\nmodels allow any developer to perform prefix segmentation, morphological\ntagging and question answering of a Hebrew input with a single call to a\nHuggingFace model, without the need to integrate any additional libraries or\ncode. In this paper we describe the details of the training as well and the\nresults on the different benchmarks. We release the models to the community,\nalong with sample code demonstrating their use. We release these models as part\nof our goal to help further research and development in Hebrew NLP.\n","authors":["Shaltiel Shmidman","Avi Shmidman","Moshe Koppel"],"pdf_url":"https://arxiv.org/pdf/2308.16687v2.pdf","comment":"Updated second version, with links to two question-answering models"},{"id":"http://arxiv.org/abs/2310.09107v1","updated":"2023-10-13T13:52:15Z","published":"2023-10-13T13:52:15Z","title":"GLoRE: Evaluating Logical Reasoning of Large Language Models","summary":" Recently, large language models (LLMs), including notable models such as\nGPT-4 and burgeoning community models, have showcased significant general\nlanguage understanding abilities. However, there has been a scarcity of\nattempts to assess the logical reasoning capacities of these LLMs, an essential\nfacet of natural language understanding. To encourage further investigation in\nthis area, we introduce GLoRE, a meticulously assembled General Logical\nReasoning Evaluation benchmark comprised of 12 datasets that span three\ndifferent types of tasks. Our experimental results show that compared to the\nperformance of human and supervised fine-tuning, the logical reasoning\ncapabilities of open LLM models necessitate additional improvement; ChatGPT and\nGPT-4 show a strong capability of logical reasoning, with GPT-4 surpassing\nChatGPT by a large margin. We propose a self-consistency probing method to\nenhance the accuracy of ChatGPT and a fine-tuned method to boost the\nperformance of an open LLM. We release the datasets and evaluation programs to\nfacilitate future research.\n","authors":["Hanmeng liu","Zhiyang Teng","Ruoxi Ning","Jian Liu","Qiji Zhou","Yue Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.09107v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.06927v2","updated":"2023-10-13T13:47:44Z","published":"2023-10-10T18:28:38Z","title":"Sparse Fine-tuning for Inference Acceleration of Large Language Models","summary":" We consider the problem of accurate sparse fine-tuning of large language\nmodels (LLMs), that is, fine-tuning pretrained LLMs on specialized tasks, while\ninducing sparsity in their weights. On the accuracy side, we observe that\nstandard loss-based fine-tuning may fail to recover accuracy, especially at\nhigh sparsities. To address this, we perform a detailed study of\ndistillation-type losses, determining an L2-based distillation approach we term\nSquareHead which enables accurate recovery even at higher sparsities, across\nall model types. On the practical efficiency side, we show that sparse LLMs can\nbe executed with speedups by taking advantage of sparsity, for both CPU and GPU\nruntimes. While the standard approach is to leverage sparsity for computational\nreduction, we observe that in the case of memory-bound LLMs sparsity can also\nbe leveraged for reducing memory bandwidth. We exhibit end-to-end results\nshowing speedups due to sparsity, while recovering accuracy, on T5 (language\ntranslation), Whisper (speech translation), and open GPT-type (MPT for text\ngeneration). For MPT text generation, we show for the first time that sparse\nfine-tuning can reach 75% sparsity without accuracy drops, provide notable\nend-to-end speedups for both CPU and GPU inference, and highlight that sparsity\nis also compatible with quantization approaches. Models and software for\nreproducing our results are provided in Section 6.\n","authors":["Eldar Kurtic","Denis Kuznedelev","Elias Frantar","Michael Goin","Dan Alistarh"],"pdf_url":"https://arxiv.org/pdf/2310.06927v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.02071v2","updated":"2023-10-13T13:43:53Z","published":"2023-10-03T14:13:36Z","title":"Towards End-to-End Embodied Decision Making via Multi-modal Large\n Language Model: Explorations with GPT4-Vision and Beyond","summary":" In this study, we explore the potential of Multimodal Large Language Models\n(MLLMs) in improving embodied decision-making processes for agents. While Large\nLanguage Models (LLMs) have been widely used due to their advanced reasoning\nskills and vast world knowledge, MLLMs like GPT4-Vision offer enhanced visual\nunderstanding and reasoning capabilities. We investigate whether\nstate-of-the-art MLLMs can handle embodied decision-making in an end-to-end\nmanner and whether collaborations between LLMs and MLLMs can enhance\ndecision-making. To address these questions, we introduce a new benchmark\ncalled PCA-EVAL, which evaluates embodied decision-making from the perspectives\nof Perception, Cognition, and Action. Additionally, we propose HOLMES, a\nmulti-agent cooperation framework that allows LLMs to leverage MLLMs and APIs\nto gather multimodal information for informed decision-making. We compare\nend-to-end embodied decision-making and HOLMES on our benchmark and find that\nthe GPT4-Vision model demonstrates strong end-to-end embodied decision-making\nabilities, outperforming GPT4-HOLMES in terms of average decision accuracy\n(+3%). However, this performance is exclusive to the latest GPT4-Vision model,\nsurpassing the open-source state-of-the-art MLLM by 26%. Our results indicate\nthat powerful MLLMs like GPT4-Vision hold promise for decision-making in\nembodied agents, offering new avenues for MLLM research. Code and data are open\nat https://github.com/pkunlp-icler/PCA-EVAL/.\n","authors":["Liang Chen","Yichi Zhang","Shuhuai Ren","Haozhe Zhao","Zefan Cai","Yuchi Wang","Peiyi Wang","Tianyu Liu","Baobao Chang"],"pdf_url":"https://arxiv.org/pdf/2310.02071v2.pdf","comment":"18 pages, 10 figures, Code and data:\n https://github.com/pkunlp-icler/PCA-EVAL/"},{"id":"http://arxiv.org/abs/2211.03462v2","updated":"2023-10-13T13:20:51Z","published":"2022-11-07T11:25:21Z","title":"NAPG: Non-Autoregressive Program Generation for Hybrid Tabular-Textual\n Question Answering","summary":" Hybrid tabular-textual question answering (QA) requires reasoning from\nheterogeneous information, and the types of reasoning are mainly divided into\nnumerical reasoning and span extraction. Current numerical reasoning methods\nautoregressively decode program sequences, and each decoding step produces\neither an operator or an operand. However, the step-by-step decoding suffers\nfrom exposure bias, and the accuracy of program generation drops sharply as the\ndecoding steps unfold due to error propagation. In this paper, we propose a\nnon-autoregressive program generation framework, which independently generates\ncomplete program tuples containing both operators and operands, can address the\nerror propagation issue while significantly boosting the speed of program\ngeneration. Experiments on the ConvFinQA and MultiHiertt datasets show that our\nnon-autoregressive program generation method can bring about substantial\nimprovements over the strong FinQANet (+5.06 Exe Acc and +4.80 Prog Acc points)\nand MT2Net (+7.97 EM and +6.38 F1 points) baselines, establishing the new\nstate-of-the-art performance, while being much faster (21x) in program\ngeneration. Finally, with increasing numbers of numerical reasoning steps the\nperformance drop of our method is significantly smaller than that of the\nbaselines. Our code will be publicly available soon.\n","authors":["Tengxun Zhang","Hongfei Xu","Josef van Genabith","Deyi Xiong","Hongying Zan"],"pdf_url":"https://arxiv.org/pdf/2211.03462v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09089v1","updated":"2023-10-13T13:17:03Z","published":"2023-10-13T13:17:03Z","title":"Qilin-Med: Multi-stage Knowledge Injection Advanced Medical Large\n Language Model","summary":" Integrating large language models (LLMs) into healthcare presents potential\nbut faces challenges. Directly pre-training LLMs for domains like medicine is\nresource-heavy and sometimes unfeasible. Sole reliance on Supervised\nFine-tuning (SFT) can result in overconfident predictions and may not tap into\ndomain specific insights. Addressing these challenges, we present a multi-stage\ntraining method combining Domain-specific Continued Pre-training (DCPT), SFT,\nand Direct Preference Optimization (DPO). A notable contribution of our study\nis the introduction of a 3Gb Chinese Medicine (ChiMed) dataset, encompassing\nmedical question answering, plain texts, knowledge graphs, and dialogues,\nsegmented into three training stages. The medical LLM trained with our\npipeline, Qilin-Med, exhibits significant performance boosts. In the CPT and\nSFT phases, it achieves 38.4% and 40.0% accuracy on the CMExam, surpassing\nBaichuan-7B's 33.5%. In the DPO phase, on the Huatuo-26M test set, it scores\n16.66 in BLEU-1 and 27.44 in ROUGE1, outperforming the SFT's 12.69 and 24.21.\nThis highlights the strength of our training approach in refining LLMs for\nmedical applications.\n","authors":["Qichen Ye","Junling Liu","Dading Chong","Peilin Zhou","Yining Hua","Andrew Liu"],"pdf_url":"https://arxiv.org/pdf/2310.09089v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09088v1","updated":"2023-10-13T13:16:57Z","published":"2023-10-13T13:16:57Z","title":"Dialect Transfer for Swiss German Speech Translation","summary":" This paper investigates the challenges in building Swiss German speech\ntranslation systems, specifically focusing on the impact of dialect diversity\nand differences between Swiss German and Standard German. Swiss German is a\nspoken language with no formal writing system, it comprises many diverse\ndialects and is a low-resource language with only around 5 million speakers.\nThe study is guided by two key research questions: how does the inclusion and\nexclusion of dialects during the training of speech translation models for\nSwiss German impact the performance on specific dialects, and how do the\ndifferences between Swiss German and Standard German impact the performance of\nthe systems? We show that dialect diversity and linguistic differences pose\nsignificant challenges to Swiss German speech translation, which is in line\nwith linguistic hypotheses derived from empirical investigations.\n","authors":["Claudio Paonessa","Yanick Schraner","Jan Deriu","Manuela Hürlimann","Manfred Vogel","Mark Cieliebak"],"pdf_url":"https://arxiv.org/pdf/2310.09088v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.13619v2","updated":"2023-10-13T12:13:21Z","published":"2023-02-27T09:40:41Z","title":"Orca: A Few-shot Benchmark for Chinese Conversational Machine Reading\n Comprehension","summary":" The conversational machine reading comprehension (CMRC) task aims to answer\nquestions in conversations, which has been a hot research topic in recent years\nbecause of its wide applications. However, existing CMRC benchmarks in which\neach conversation is assigned a static passage are inconsistent with real\nscenarios. Thus, model's comprehension ability towards real scenarios are hard\nto evaluate reasonably. To this end, we propose the first Chinese CMRC\nbenchmark Orca and further provide zero-shot/few-shot settings to evaluate\nmodel's generalization ability towards diverse domains. We collect 831\nhot-topic driven conversations with 4,742 turns in total. Each turn of a\nconversation is assigned with a response-related passage, aiming to evaluate\nmodel's comprehension ability more reasonably. The topics of conversations are\ncollected from social media platform and cover 33 domains, trying to be\nconsistent with real scenarios. Importantly, answers in Orca are all\nwell-annotated natural responses rather than the specific spans or short phrase\nin previous datasets. Besides, we implement three strong baselines to tackle\nthe challenge in Orca. The results indicate the great challenge of our CMRC\nbenchmark. Our datatset and checkpoints are available at\nhttps://github.com/nuochenpku/Orca.\n","authors":["Nuo Chen","Hongguang Li","Junqing He","Yinan Bao","Xinshi Lin","Qi Yang","Jianfeng Liu","Ruyi Gan","Jiaxing Zhang","Baoyuan Wang","Jia Li"],"pdf_url":"https://arxiv.org/pdf/2302.13619v2.pdf","comment":"14 pages"},{"id":"http://arxiv.org/abs/2310.09044v1","updated":"2023-10-13T12:12:34Z","published":"2023-10-13T12:12:34Z","title":"KCTS: Knowledge-Constrained Tree Search Decoding with Token-Level\n Hallucination Detection","summary":" Large Language Models (LLMs) have demonstrated remarkable human-level natural\nlanguage generation capabilities. However, their potential to generate\nmisinformation, often called the hallucination problem, poses a significant\nrisk to their deployment. A common approach to address this issue is to\nretrieve relevant knowledge and fine-tune the LLM with the knowledge in its\ninput. Unfortunately, this method incurs high training costs and may cause\ncatastrophic forgetting for multi-tasking models. To overcome these\nlimitations, we propose a knowledge-constrained decoding method called KCTS\n(Knowledge-Constrained Tree Search), which guides a frozen LM to generate text\naligned with the reference knowledge at each decoding step using a knowledge\nclassifier score and MCTS (Monte-Carlo Tree Search). To adapt the\nsequence-level knowledge classifier to token-level guidance, we also propose a\nnovel token-level hallucination detection method called RIPA (Reward Inflection\nPoint Approximation). Our empirical results on knowledge-grounded dialogue and\nabstractive summarization demonstrate the strength of KCTS as a plug-and-play,\nmodel-agnostic decoding method that can effectively reduce hallucinations in\nnatural language generation.\n","authors":["Sehyun Choi","Tianqing Fang","Zhaowei Wang","Yangqiu Song"],"pdf_url":"https://arxiv.org/pdf/2310.09044v1.pdf","comment":"Accepted at EMNLP 2023 Main Conference"},{"id":"http://arxiv.org/abs/2310.02778v2","updated":"2023-10-13T12:10:01Z","published":"2023-10-04T12:50:26Z","title":"Integrating UMLS Knowledge into Large Language Models for Medical\n Question Answering","summary":" Large language models (LLMs) have demonstrated powerful text generation\ncapabilities, bringing unprecedented innovation to the healthcare field. While\nLLMs hold immense promise for applications in healthcare, applying them to real\nclinical scenarios presents significant challenges, as these models may\ngenerate content that deviates from established medical facts and even exhibit\npotential biases. In our research, we develop an augmented LLM framework based\non the Unified Medical Language System (UMLS), aiming to better serve the\nhealthcare community. We employ LLaMa2-13b-chat and ChatGPT-3.5 as our\nbenchmark models, and conduct automatic evaluations using the ROUGE Score and\nBERTScore on 104 questions from the LiveQA test set. Additionally, we establish\ncriteria for physician-evaluation based on four dimensions: Factuality,\nCompleteness, Readability and Relevancy. ChatGPT-3.5 is used for physician\nevaluation with 20 questions on the LiveQA test set. Multiple resident\nphysicians conducted blind reviews to evaluate the generated content, and the\nresults indicate that this framework effectively enhances the factuality,\ncompleteness, and relevance of generated content. Our research demonstrates the\neffectiveness of using UMLS-augmented LLMs and highlights the potential\napplication value of LLMs in in medical question-answering.\n","authors":["Rui Yang","Edison Marrese-Taylor","Yuhe Ke","Lechao Cheng","Qingyu Chen","Irene Li"],"pdf_url":"https://arxiv.org/pdf/2310.02778v2.pdf","comment":"12 pages, 3 figures"},{"id":"http://arxiv.org/abs/2210.03029v3","updated":"2023-10-13T11:58:30Z","published":"2022-10-06T16:26:03Z","title":"Efficiently Enhancing Zero-Shot Performance of Instruction Following\n Model via Retrieval of Soft Prompt","summary":" Enhancing the zero-shot performance of instruction-following models requires\nheavy computation, either by scaling the total number of training datasets or\nthe model size. In this work, we explore how retrieval of soft prompts obtained\nthrough prompt tuning can efficiently assist hard prompts in zero-shot task\ngeneralization. Specifically, we train soft prompt embeddings for each prompt\nthrough prompt tuning, store the samples of the training instances mapped with\nthe prompt embeddings, and retrieve the corresponding prompt embedding of the\ntraining instance closest to the query instance during inference. While only\nadding 0.007% additional parameters, retrieval of soft prompt enhances the\nperformance of T0 on unseen tasks by outperforming it on 10 out of 11 datasets\nas well as improving the mean accuracy of T0 on BIG-bench benchmark by 2.39%\npoints. Also, we report an interesting finding that retrieving source\nembeddings trained on similar answer choice formats is more important than\nthose on similar task types.\n","authors":["Seonghyeon Ye","Joel Jang","Doyoung Kim","Yongrae Jo","Minjoon Seo"],"pdf_url":"https://arxiv.org/pdf/2210.03029v3.pdf","comment":"EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2310.09036v1","updated":"2023-10-13T11:57:04Z","published":"2023-10-13T11:57:04Z","title":"MM-BigBench: Evaluating Multimodal Models on Multimodal Content\n Comprehension Tasks","summary":" The popularity of multimodal large language models (MLLMs) has triggered a\nrecent surge in research efforts dedicated to evaluating these models.\nNevertheless, existing evaluation studies of MLLMs primarily focus on the\ncomprehension and reasoning of unimodal (vision) content, neglecting\nperformance evaluations in the domain of multimodal (vision-language) content\nunderstanding. Beyond multimodal reasoning, tasks related to multimodal content\ncomprehension necessitate a profound understanding of multimodal contexts,\nachieved through the multimodal interaction to obtain a final answer. In this\npaper, we introduce a comprehensive assessment framework called MM-BigBench,\nwhich incorporates a diverse range of metrics to offer an extensive evaluation\nof the performance of various models and instructions across a wide spectrum of\ndiverse multimodal content comprehension tasks. Consequently, our work\ncomplements research on the performance of MLLMs in multimodal comprehension\ntasks, achieving a more comprehensive and holistic evaluation of MLLMs. To\nbegin, we employ the Best Performance metric to ascertain each model's\nperformance upper bound on different datasets. Subsequently, the Mean Relative\nGain metric offers an assessment of the overall performance of various models\nand instructions, while the Stability metric measures their sensitivity.\nFurthermore, previous research centers on evaluating models independently or\nsolely assessing instructions, neglecting the adaptability between models and\ninstructions. We propose the Adaptability metric to quantify the adaptability\nbetween models and instructions. Our paper evaluates a total of 20 language\nmodels (14 MLLMs) on 14 multimodal datasets spanning 6 tasks, with 10\ninstructions for each task, and derives novel insights. Our code will be\nreleased at https://github.com/declare-lab/MM-BigBench.\n","authors":["Xiaocui Yang","Wenfang Wu","Shi Feng","Ming Wang","Daling Wang","Yang Li","Qi Sun","Yifei Zhang","Xiaoming Fu","Soujanya Poria"],"pdf_url":"https://arxiv.org/pdf/2310.09036v1.pdf","comment":"Underview"},{"id":"http://arxiv.org/abs/2310.09017v1","updated":"2023-10-13T11:28:02Z","published":"2023-10-13T11:28:02Z","title":"Dont Add, dont Miss: Effective Content Preserving Generation from\n Pre-Selected Text Spans","summary":" The recently introduced Controlled Text Reduction (CTR) task isolates the\ntext generation step within typical summarization-style tasks. It does so by\nchallenging models to generate coherent text conforming to pre-selected content\nwithin the input text (\"highlights\").\n This framing enables increased modularity in summarization-like tasks,\nallowing to couple a single CTR model with various content-selection setups and\nmodules.\n However, there are currently no reliable CTR models, while the performance of\nthe existing baseline for the task is mediocre, falling short of practical\nutility.\n Here, we address this gap by introducing a high-quality, open-source CTR\nmodel that tackles two prior key limitations: inadequate enforcement of the\ncontent-preservation constraint, and suboptimal silver training data.\n Addressing these, we amplify the content-preservation constraint in both\ntraining, via RL, and inference, via a controlled decoding strategy.\n Further, we substantially improve the silver training data quality via GPT-4\ndistillation.\n Overall, pairing the distilled dataset with the highlight-adherence\nstrategies yields marked gains over the current baseline, of up to 30 ROUGE-L\npoints, providing a reliable CTR model for downstream use.\n","authors":["Aviv Slobodkin","Avi Caciularu","Eran Hirsch","Ido Dagan"],"pdf_url":"https://arxiv.org/pdf/2310.09017v1.pdf","comment":"EMNLP 2023, findings"},{"id":"http://arxiv.org/abs/2305.13066v2","updated":"2023-10-13T11:19:19Z","published":"2023-05-22T14:36:32Z","title":"Biomedical Named Entity Recognition via Dictionary-based Synonym\n Generalization","summary":" Biomedical named entity recognition is one of the core tasks in biomedical\nnatural language processing (BioNLP). To tackle this task, numerous\nsupervised/distantly supervised approaches have been proposed. Despite their\nremarkable success, these approaches inescapably demand laborious human effort.\nTo alleviate the need of human effort, dictionary-based approaches have been\nproposed to extract named entities simply based on a given dictionary. However,\none downside of existing dictionary-based approaches is that they are\nchallenged to identify concept synonyms that are not listed in the given\ndictionary, which we refer as the synonym generalization problem. In this\nstudy, we propose a novel Synonym Generalization (SynGen) framework that\nrecognizes the biomedical concepts contained in the input text using span-based\npredictions. In particular, SynGen introduces two regularization terms, namely,\n(1) a synonym distance regularizer; and (2) a noise perturbation regularizer,\nto minimize the synonym generalization error. To demonstrate the effectiveness\nof our approach, we provide a theoretical analysis of the bound of synonym\ngeneralization error. We extensively evaluate our approach on a wide range of\nbenchmarks and the results verify that SynGen outperforms previous\ndictionary-based models by notable margins. Lastly, we provide a detailed\nanalysis to further reveal the merits and inner-workings of our approach.\n","authors":["Zihao Fu","Yixuan Su","Zaiqiao Meng","Nigel Collier"],"pdf_url":"https://arxiv.org/pdf/2305.13066v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07397v2","updated":"2023-10-13T11:16:58Z","published":"2023-10-11T11:32:57Z","title":"Target-oriented Proactive Dialogue Systems with Personalization: Problem\n Formulation and Dataset Curation","summary":" Target-oriented dialogue systems, designed to proactively steer conversations\ntoward predefined targets or accomplish specific system-side goals, are an\nexciting area in conversational AI. In this work, by formulating a pair as the conversation target, we explore a novel problem of\npersonalized target-oriented dialogue by considering personalization during the\ntarget accomplishment process. However, there remains an emergent need for\nhigh-quality datasets, and building one from scratch requires tremendous human\neffort. To address this, we propose an automatic dataset curation framework\nusing a role-playing approach. Based on this framework, we construct a\nlarge-scale personalized target-oriented dialogue dataset, TopDial, which\ncomprises about 18K multi-turn dialogues. The experimental results show that\nthis dataset is of high quality and could contribute to exploring personalized\ntarget-oriented dialogue.\n","authors":["Jian Wang","Yi Cheng","Dongding Lin","Chak Tou Leong","Wenjie Li"],"pdf_url":"https://arxiv.org/pdf/2310.07397v2.pdf","comment":"Accepted to EMNLP-2023 main conference"},{"id":"http://arxiv.org/abs/2305.13026v2","updated":"2023-10-13T10:43:05Z","published":"2023-05-22T13:27:37Z","title":"DUMB: A Benchmark for Smart Evaluation of Dutch Models","summary":" We introduce the Dutch Model Benchmark: DUMB. The benchmark includes a\ndiverse set of datasets for low-, medium- and high-resource tasks. The total\nset of nine tasks includes four tasks that were previously not available in\nDutch. Instead of relying on a mean score across tasks, we propose Relative\nError Reduction (RER), which compares the DUMB performance of language models\nto a strong baseline which can be referred to in the future even when assessing\ndifferent sets of language models. Through a comparison of 14 pre-trained\nlanguage models (mono- and multi-lingual, of varying sizes), we assess the\ninternal consistency of the benchmark tasks, as well as the factors that likely\nenable high performance. Our results indicate that current Dutch monolingual\nmodels under-perform and suggest training larger Dutch models with other\narchitectures and pre-training objectives. At present, the highest performance\nis achieved by DeBERTaV3 (large), XLM-R (large) and mDeBERTaV3 (base). In\naddition to highlighting best strategies for training larger Dutch models, DUMB\nwill foster further research on Dutch. A public leaderboard is available at\nhttps://dumbench.nl.\n","authors":["Wietse de Vries","Martijn Wieling","Malvina Nissim"],"pdf_url":"https://arxiv.org/pdf/2305.13026v2.pdf","comment":"EMNLP 2023 camera-ready"},{"id":"http://arxiv.org/abs/2310.08992v1","updated":"2023-10-13T10:17:48Z","published":"2023-10-13T10:17:48Z","title":"CodeChain: Towards Modular Code Generation Through Chain of\n Self-revisions with Representative Sub-modules","summary":" Large Language Models (LLMs) have already become quite proficient at solving\nsimpler programming tasks like those in HumanEval or MBPP benchmarks. However,\nsolving more complex and competitive programming tasks is still quite\nchallenging for these models - possibly due to their tendency to generate\nsolutions as monolithic code blocks instead of decomposing them into logical\nsub-tasks and sub-modules. On the other hand, experienced programmers\ninstinctively write modularized code with abstraction for solving complex\ntasks, often reusing previously developed modules. To address this gap, we\npropose CodeChain, a novel framework for inference that elicits modularized\ncode generation through a chain of self-revisions, each being guided by some\nrepresentative sub-modules generated in previous iterations. Concretely,\nCodeChain first instructs the LLM to generate modularized codes through\nchain-of-thought prompting. Then it applies a chain of self-revisions by\niterating the two steps: 1) extracting and clustering the generated sub-modules\nand selecting the cluster representatives as the more generic and re-usable\nimplementations, and 2) augmenting the original chain-of-thought prompt with\nthese selected module-implementations and instructing the LLM to re-generate\nnew modularized solutions. We find that by naturally encouraging the LLM to\nreuse the previously developed and verified sub-modules, CodeChain can\nsignificantly boost both modularity as well as correctness of the generated\nsolutions, achieving relative pass@1 improvements of 35% on APPS and 76% on\nCodeContests. It is shown to be effective on both OpenAI LLMs as well as\nopen-sourced LLMs like WizardCoder. We also conduct comprehensive ablation\nstudies with different methods of prompting, number of clusters, model sizes,\nprogram qualities, etc., to provide useful insights that underpin CodeChain's\nsuccess.\n","authors":["Hung Le","Hailin Chen","Amrita Saha","Akash Gokul","Doyen Sahoo","Shafiq Joty"],"pdf_url":"https://arxiv.org/pdf/2310.08992v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.05364v3","updated":"2023-10-13T09:47:08Z","published":"2023-10-09T02:50:54Z","title":"Universal Multi-modal Entity Alignment via Iteratively Fusing Modality\n Similarity Paths","summary":" The objective of Entity Alignment (EA) is to identify equivalent entity pairs\nfrom multiple Knowledge Graphs (KGs) and create a more comprehensive and\nunified KG. The majority of EA methods have primarily focused on the structural\nmodality of KGs, lacking exploration of multi-modal information. A few\nmulti-modal EA methods have made good attempts in this field. Still, they have\ntwo shortcomings: (1) inconsistent and inefficient modality modeling that\ndesigns complex and distinct models for each modality; (2) ineffective modality\nfusion due to the heterogeneous nature of modalities in EA. To tackle these\nchallenges, we propose PathFusion, consisting of two main components: (1) MSP,\na unified modeling approach that simplifies the alignment process by\nconstructing paths connecting entities and modality nodes to represent multiple\nmodalities; (2) IRF, an iterative fusion method that effectively combines\ninformation from different modalities using the path as an information carrier.\nExperimental results on real-world datasets demonstrate the superiority of\nPathFusion over state-of-the-art methods, with 22.4%-28.9% absolute improvement\non Hits@1, and 0.194-0.245 absolute improvement on MRR.\n","authors":["Bolin Zhu","Xiaoze Liu","Xin Mao","Zhuo Chen","Lingbing Guo","Tao Gui","Qi Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.05364v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08975v1","updated":"2023-10-13T09:45:14Z","published":"2023-10-13T09:45:14Z","title":"ChatKBQA: A Generate-then-Retrieve Framework for Knowledge Base Question\n Answering with Fine-tuned Large Language Models","summary":" Knowledge Base Question Answering (KBQA) aims to derive answers to natural\nlanguage questions over large-scale knowledge bases (KBs), which are generally\ndivided into two research components: knowledge retrieval and semantic parsing.\nHowever, three core challenges remain, including inefficient knowledge\nretrieval, retrieval errors adversely affecting semantic parsing, and the\ncomplexity of previous KBQA methods. In the era of large language models\n(LLMs), we introduce ChatKBQA, a novel generate-then-retrieve KBQA framework\nbuilt on fine-tuning open-source LLMs such as Llama-2, ChatGLM2 and Baichuan2.\nChatKBQA proposes generating the logical form with fine-tuned LLMs first, then\nretrieving and replacing entities and relations through an unsupervised\nretrieval method, which improves both generation and retrieval more\nstraightforwardly. Experimental results reveal that ChatKBQA achieves new\nstate-of-the-art performance on standard KBQA datasets, WebQSP, and\nComplexWebQuestions (CWQ). This work also provides a new paradigm for combining\nLLMs with knowledge graphs (KGs) for interpretable and knowledge-required\nquestion answering. Our code is publicly available.\n","authors":["Haoran Luo","Haihong E","Zichen Tang","Shiyao Peng","Yikai Guo","Wentai Zhang","Chenghao Ma","Guanting Dong","Meina Song","Wei Lin"],"pdf_url":"https://arxiv.org/pdf/2310.08975v1.pdf","comment":"Preprint"},{"id":"http://arxiv.org/abs/2310.08967v1","updated":"2023-10-13T09:18:57Z","published":"2023-10-13T09:18:57Z","title":"Towards Example-Based NMT with Multi-Levenshtein Transformers","summary":" Retrieval-Augmented Machine Translation (RAMT) is attracting growing\nattention. This is because RAMT not only improves translation metrics, but is\nalso assumed to implement some form of domain adaptation. In this contribution,\nwe study another salient trait of RAMT, its ability to make translation\ndecisions more transparent by allowing users to go back to examples that\ncontributed to these decisions.\n For this, we propose a novel architecture aiming to increase this\ntransparency. This model adapts a retrieval-augmented version of the\nLevenshtein Transformer and makes it amenable to simultaneously edit multiple\nfuzzy matches found in memory. We discuss how to perform training and inference\nin this model, based on multi-way alignment algorithms and imitation learning.\nOur experiments show that editing several examples positively impacts\ntranslation scores, notably increasing the number of target spans that are\ncopied from existing instances.\n","authors":["Maxime Bouthors","Josep Crego","François Yvon"],"pdf_url":"https://arxiv.org/pdf/2310.08967v1.pdf","comment":"17 pages, EMNLP 2023 submission"},{"id":"http://arxiv.org/abs/2310.08958v1","updated":"2023-10-13T09:07:13Z","published":"2023-10-13T09:07:13Z","title":"xDial-Eval: A Multilingual Open-Domain Dialogue Evaluation Benchmark","summary":" Recent advancements in reference-free learned metrics for open-domain\ndialogue evaluation have been driven by the progress in pre-trained language\nmodels and the availability of dialogue data with high-quality human\nannotations. However, current studies predominantly concentrate on English\ndialogues, and the generalization of these metrics to other languages has not\nbeen fully examined. This is largely due to the absence of a multilingual\ndialogue evaluation benchmark. To address the issue, we introduce xDial-Eval,\nbuilt on top of open-source English dialogue evaluation datasets. xDial-Eval\nincludes 12 turn-level and 6 dialogue-level English datasets, comprising 14930\nannotated turns and 8691 annotated dialogues respectively. The English dialogue\ndata are extended to nine other languages with commercial machine translation\nsystems. On xDial-Eval, we conduct comprehensive analyses of previous\nBERT-based metrics and the recently-emerged large language models. Lastly, we\nestablish strong self-supervised and multilingual baselines. In terms of\naverage Pearson correlations over all datasets and languages, the best baseline\noutperforms OpenAI's ChatGPT by absolute improvements of 6.5% and 4.6% at the\nturn and dialogue levels respectively, albeit with much fewer parameters. The\ndata and code are publicly available at https://github.com/e0397123/xDial-Eval.\n","authors":["Chen Zhang","Luis Fernando D'Haro","Chengguang Tang","Ke Shi","Guohua Tang","Haizhou Li"],"pdf_url":"https://arxiv.org/pdf/2310.08958v1.pdf","comment":"Accepted to EMNLP-2023 Findings"},{"id":"http://arxiv.org/abs/2305.02615v2","updated":"2023-10-13T09:02:23Z","published":"2023-05-04T07:45:49Z","title":"How to Enhance Causal Discrimination of Utterances: A Case on Affective\n Reasoning","summary":" Our investigation into the Affective Reasoning in Conversation (ARC) task\nhighlights the challenge of causal discrimination. Almost all existing models,\nincluding large language models (LLMs), excel at capturing semantic\ncorrelations within utterance embeddings but fall short in determining the\nspecific causal relationships. To overcome this limitation, we propose the\nincorporation of \\textit{i.i.d.} noise terms into the conversation process,\nthereby constructing a structural causal model (SCM). It explores how distinct\ncausal relationships of fitted embeddings can be discerned through independent\nconditions. To facilitate the implementation of deep learning, we introduce the\ncogn frameworks to handle unstructured conversation data, and employ an\nautoencoder architecture to regard the unobservable noise as learnable\n\"implicit causes.\" Moreover, we curate a synthetic dataset that includes i.i.d.\nnoise. Through comprehensive experiments, we validate the effectiveness and\ninterpretability of our approach. Our code is available in\nhttps://github.com/Zodiark-ch/mater-of-our-EMNLP2023-paper.\n","authors":["Hang Chen","Jing Luo","Xinyu Yang","Wenjing Zhu"],"pdf_url":"https://arxiv.org/pdf/2305.02615v2.pdf","comment":"accepted via EMNLP2023-main"},{"id":"http://arxiv.org/abs/2310.08954v1","updated":"2023-10-13T08:55:19Z","published":"2023-10-13T08:55:19Z","title":"Textual Analysis of ICALEPCS and IPAC Conference Proceedings: Revealing\n Research Trends, Topics, and Collaborations for Future Insights and Advanced\n Search","summary":" In this paper, we show a textual analysis of past ICALEPCS and IPAC\nconference proceedings to gain insights into the research trends and topics\ndiscussed in the field. We use natural language processing techniques to\nextract meaningful information from the abstracts and papers of past conference\nproceedings. We extract topics to visualize and identify trends, analyze their\nevolution to identify emerging research directions, and highlight interesting\npublications based solely on their content with an analysis of their network.\nAdditionally, we will provide an advanced search tool to better search the\nexisting papers to prevent duplication and easier reference findings. Our\nanalysis provides a comprehensive overview of the research landscape in the\nfield and helps researchers and practitioners to better understand the\nstate-of-the-art and identify areas for future research.\n","authors":["Antonin Sulc","Annika Eichler","Tim Wilksen"],"pdf_url":"https://arxiv.org/pdf/2310.08954v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08949v1","updated":"2023-10-13T08:38:56Z","published":"2023-10-13T08:38:56Z","title":"Making Multimodal Generation Easier: When Diffusion Models Meet LLMs","summary":" We present EasyGen, an efficient model designed to enhance multimodal\nunderstanding and generation by harnessing the capabilities of diffusion models\nand large language models (LLMs). Unlike existing multimodal models that\npredominately depend on encoders like CLIP or ImageBind and need ample amounts\nof training data to bridge the gap between modalities, EasyGen is built upon a\nbidirectional conditional diffusion model named BiDiffuser, which promotes more\nefficient interactions between modalities. EasyGen handles image-to-text\ngeneration by integrating BiDiffuser and an LLM via a simple projection layer.\nUnlike most existing multimodal models that are limited to generating text\nresponses, EasyGen can also facilitate text-to-image generation by leveraging\nthe LLM to create textual descriptions, which can be interpreted by BiDiffuser\nto generate appropriate visual responses. Extensive quantitative and\nqualitative experiments demonstrate the effectiveness of EasyGen, whose\ntraining can be easily achieved in a lab setting. The source code is available\nat https://github.com/zxy556677/EasyGen.\n","authors":["Xiangyu Zhao","Bo Liu","Qijiong Liu","Guangyuan Shi","Xiao-Ming Wu"],"pdf_url":"https://arxiv.org/pdf/2310.08949v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08944v1","updated":"2023-10-13T08:19:31Z","published":"2023-10-13T08:19:31Z","title":"CAMELL: Confidence-based Acquisition Model for Efficient Self-supervised\n Active Learning with Label Validation","summary":" Supervised neural approaches are hindered by their dependence on large,\nmeticulously annotated datasets, a requirement that is particularly cumbersome\nfor sequential tasks. The quality of annotations tends to deteriorate with the\ntransition from expert-based to crowd-sourced labelling. To address these\nchallenges, we present \\textbf{CAMELL} (Confidence-based Acquisition Model for\nEfficient self-supervised active Learning with Label validation), a pool-based\nactive learning framework tailored for sequential multi-output problems. CAMELL\npossesses three core features: (1) it requires expert annotators to label only\na fraction of a chosen sequence, (2) it facilitates self-supervision for the\nremainder of the sequence, and (3) it employs a label validation mechanism to\nprevent erroneous labels from contaminating the dataset and harming model\nperformance. We evaluate CAMELL on sequential tasks, with a special emphasis on\ndialogue belief tracking, a task plagued by the constraints of limited and\nnoisy datasets. Our experiments demonstrate that CAMELL outperforms the\nbaselines in terms of efficiency. Furthermore, the data corrections suggested\nby our method contribute to an overall improvement in the quality of the\nresulting datasets.\n","authors":["Carel van Niekerk","Christian Geishauser","Michael Heck","Shutong Feng","Hsien-chin Lin","Nurul Lubis","Benjamin Ruppik","Renato Vukovic","Milica Gašić"],"pdf_url":"https://arxiv.org/pdf/2310.08944v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08943v1","updated":"2023-10-13T08:16:27Z","published":"2023-10-13T08:16:27Z","title":"Multi-level Adaptive Contrastive Learning for Knowledge Internalization\n in Dialogue Generation","summary":" Knowledge-grounded dialogue generation aims to mitigate the issue of text\ndegeneration by incorporating external knowledge to supplement the context.\nHowever, the model often fails to internalize this information into responses\nin a human-like manner. Instead, it simply inserts segments of the provided\nknowledge into generic responses. As a result, the generated responses tend to\nbe tedious, incoherent, and in lack of interactivity which means the\ndegeneration problem is still unsolved. In this work, we first find that such\ncopying-style degeneration is primarily due to the weak likelihood objective,\nwhich allows the model to \"cheat\" the objective by merely duplicating knowledge\nsegments in a superficial pattern matching based on overlap. To overcome this\nchallenge, we then propose a Multi-level Adaptive Contrastive Learning (MACL)\nframework that dynamically samples negative examples and subsequently penalizes\ndegeneration behaviors at both the token-level and sequence-level. Extensive\nexperiments on the WoW dataset demonstrate the effectiveness of our approach\nacross various pre-trained models.\n","authors":["Chenxu Yang","Zheng Lin","Lanrui Wang","Chong Tian","Liang Pang","Jiangnan Li","Yanan Cao","Weiping Wang"],"pdf_url":"https://arxiv.org/pdf/2310.08943v1.pdf","comment":"EMNLP 2023 main conference"},{"id":"http://arxiv.org/abs/2310.08923v1","updated":"2023-10-13T07:49:11Z","published":"2023-10-13T07:49:11Z","title":"Towards Informative Few-Shot Prompt with Maximum Information Gain for\n In-Context Learning","summary":" Large Language models (LLMs) possess the capability to engage In-context\nLearning (ICL) by leveraging a few demonstrations pertaining to a new\ndownstream task as conditions. However, this particular learning paradigm\nsuffers from high instability stemming from substantial variances induced by\nfactors such as the input distribution of selected examples, their ordering,\nand prompt formats. In this work, we demonstrate that even when all these\nfactors are held constant, the random selection of examples still results in\nhigh variance. Consequently, we aim to explore the informative ability of data\nexamples by quantifying the Information Gain (IG) obtained in prediction after\nobserving a given example candidate. Then we propose to sample those with\nmaximum IG. Additionally, we identify the presence of template bias, which can\nlead to unfair evaluations of IG during the sampling process. To mitigate this\nbias, we introduce Calibration Before Sampling strategy. The experimental\nresults illustrate that our proposed method can yield an average relative\nimprovement of 14.3% across six classification tasks using three LLMs.\n","authors":["Hongfu Liu","Ye Wang"],"pdf_url":"https://arxiv.org/pdf/2310.08923v1.pdf","comment":"Accepted to the Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.08917v1","updated":"2023-10-13T07:40:12Z","published":"2023-10-13T07:40:12Z","title":"Relation-aware Ensemble Learning for Knowledge Graph Embedding","summary":" Knowledge graph (KG) embedding is a fundamental task in natural language\nprocessing, and various methods have been proposed to explore semantic patterns\nin distinctive ways. In this paper, we propose to learn an ensemble by\nleveraging existing methods in a relation-aware manner. However, exploring\nthese semantics using relation-aware ensemble leads to a much larger search\nspace than general ensemble methods. To address this issue, we propose a\ndivide-search-combine algorithm RelEns-DSC that searches the relation-wise\nensemble weights independently. This algorithm has the same computation cost as\ngeneral ensemble methods but with much better performance. Experimental results\non benchmark datasets demonstrate the effectiveness of the proposed method in\nefficiently searching relation-aware ensemble weights and achieving\nstate-of-the-art embedding performance. The code is public at\nhttps://github.com/LARS-research/RelEns.\n","authors":["Ling Yue","Yongqi Zhang","Quanming Yao","Yong Li","Xian Wu","Ziheng Zhang","Zhenxi Lin","Yefeng Zheng"],"pdf_url":"https://arxiv.org/pdf/2310.08917v1.pdf","comment":"This short paper has been accepted by EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.08908v1","updated":"2023-10-13T07:30:27Z","published":"2023-10-13T07:30:27Z","title":"Human-in-the-loop Machine Translation with Large Language Model","summary":" The large language model (LLM) has garnered significant attention due to its\nin-context learning mechanisms and emergent capabilities. The research\ncommunity has conducted several pilot studies to apply LLMs to machine\ntranslation tasks and evaluate their performance from diverse perspectives.\nHowever, previous research has primarily focused on the LLM itself and has not\nexplored human intervention in the inference process of LLM. The\ncharacteristics of LLM, such as in-context learning and prompt engineering,\nclosely mirror human cognitive abilities in language tasks, offering an\nintuitive solution for human-in-the-loop generation. In this study, we propose\na human-in-the-loop pipeline that guides LLMs to produce customized outputs\nwith revision instructions. The pipeline initiates by prompting the LLM to\nproduce a draft translation, followed by the utilization of automatic retrieval\nor human feedback as supervision signals to enhance the LLM's translation\nthrough in-context learning. The human-machine interactions generated in this\npipeline are also stored in an external database to expand the in-context\nretrieval database, enabling us to leverage human supervision in an offline\nsetting. We evaluate the proposed pipeline using GPT-3.5-turbo API on five\ndomain-specific benchmarks for German-English translation. The results\ndemonstrate the effectiveness of the pipeline in tailoring in-domain\ntranslations and improving translation performance compared to direct\ntranslation. Additionally, we discuss the results from the following\nperspectives: 1) the effectiveness of different in-context retrieval methods;\n2) the construction of a retrieval database under low-resource scenarios; 3)\nthe observed domains differences; 4) the quantitative analysis of linguistic\nstatistics; and 5) the qualitative analysis of translation cases. The code and\ndata are available at https://github.com/NLP2CT/HIL-MT/.\n","authors":["Xinyi Yang","Runzhe Zhan","Derek F. Wong","Junchao Wu","Lidia S. Chao"],"pdf_url":"https://arxiv.org/pdf/2310.08908v1.pdf","comment":"Accepted to MT Summit 2023"},{"id":"http://arxiv.org/abs/2310.08903v1","updated":"2023-10-13T07:18:53Z","published":"2023-10-13T07:18:53Z","title":"SeqXGPT: Sentence-Level AI-Generated Text Detection","summary":" Widely applied large language models (LLMs) can generate human-like content,\nraising concerns about the abuse of LLMs. Therefore, it is important to build\nstrong AI-generated text (AIGT) detectors. Current works only consider\ndocument-level AIGT detection, therefore, in this paper, we first introduce a\nsentence-level detection challenge by synthesizing a dataset that contains\ndocuments that are polished with LLMs, that is, the documents contain sentences\nwritten by humans and sentences modified by LLMs. Then we propose\n\\textbf{Seq}uence \\textbf{X} (Check) \\textbf{GPT}, a novel method that utilizes\nlog probability lists from white-box LLMs as features for sentence-level AIGT\ndetection. These features are composed like \\textit{waves} in speech processing\nand cannot be studied by LLMs. Therefore, we build SeqXGPT based on convolution\nand self-attention networks. We test it in both sentence and document-level\ndetection challenges. Experimental results show that previous methods struggle\nin solving sentence-level AIGT detection, while our method not only\nsignificantly surpasses baseline methods in both sentence and document-level\ndetection challenges but also exhibits strong generalization capabilities.\n","authors":["Pengyu Wang","Linyang Li","Ke Ren","Botian Jiang","Dong Zhang","Xipeng Qiu"],"pdf_url":"https://arxiv.org/pdf/2310.08903v1.pdf","comment":"Accepted by EMNLP2023"},{"id":"http://arxiv.org/abs/2310.08901v1","updated":"2023-10-13T07:15:32Z","published":"2023-10-13T07:15:32Z","title":"Welfare Diplomacy: Benchmarking Language Model Cooperation","summary":" The growing capabilities and increasingly widespread deployment of AI systems\nnecessitate robust benchmarks for measuring their cooperative capabilities.\nUnfortunately, most multi-agent benchmarks are either zero-sum or purely\ncooperative, providing limited opportunities for such measurements. We\nintroduce a general-sum variant of the zero-sum board game Diplomacy -- called\nWelfare Diplomacy -- in which players must balance investing in military\nconquest and domestic welfare. We argue that Welfare Diplomacy facilitates both\na clearer assessment of and stronger training incentives for cooperative\ncapabilities. Our contributions are: (1) proposing the Welfare Diplomacy rules\nand implementing them via an open-source Diplomacy engine; (2) constructing\nbaseline agents using zero-shot prompted language models; and (3) conducting\nexperiments where we find that baselines using state-of-the-art models attain\nhigh social welfare but are exploitable. Our work aims to promote societal\nsafety by aiding researchers in developing and assessing multi-agent AI\nsystems. Code to evaluate Welfare Diplomacy and reproduce our experiments is\navailable at https://github.com/mukobi/welfare-diplomacy.\n","authors":["Gabriel Mukobi","Hannah Erlebach","Niklas Lauffer","Lewis Hammond","Alan Chan","Jesse Clifton"],"pdf_url":"https://arxiv.org/pdf/2310.08901v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08899v1","updated":"2023-10-13T07:03:39Z","published":"2023-10-13T07:03:39Z","title":"Exploration with Principles for Diverse AI Supervision","summary":" Training large transformers using next-token prediction has given rise to\ngroundbreaking advancements in AI. While this generative AI approach has\nproduced impressive results, it heavily leans on human supervision. Even\nstate-of-the-art AI models like ChatGPT depend on fine-tuning through human\ndemonstrations, demanding extensive human input and domain expertise. This\nstrong reliance on human oversight poses a significant hurdle to the\nadvancement of AI innovation. To address this limitation, we propose a novel\nparadigm termed Exploratory AI (EAI) aimed at autonomously generating\nhigh-quality training data. Drawing inspiration from unsupervised reinforcement\nlearning (RL) pretraining, EAI achieves exploration within the natural language\nspace. We accomplish this by harnessing large language models to assess the\nnovelty of generated content. Our approach employs two key components: an actor\nthat generates novel content following exploration principles and a critic that\nevaluates the generated content, offering critiques to guide the actor.\nEmpirical evaluations demonstrate that EAI significantly boosts model\nperformance on complex reasoning tasks, addressing the limitations of\nhuman-intensive supervision.\n","authors":["Hao Liu","Matei Zaharia","Pieter Abbeel"],"pdf_url":"https://arxiv.org/pdf/2310.08899v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08889v1","updated":"2023-10-13T06:50:15Z","published":"2023-10-13T06:50:15Z","title":"PerturbScore: Connecting Discrete and Continuous Perturbations in NLP","summary":" With the rapid development of neural network applications in NLP, model\nrobustness problem is gaining more attention. Different from computer vision,\nthe discrete nature of texts makes it more challenging to explore robustness in\nNLP. Therefore, in this paper, we aim to connect discrete perturbations with\ncontinuous perturbations, therefore we can use such connections as a bridge to\nhelp understand discrete perturbations in NLP models. Specifically, we first\nexplore how to connect and measure the correlation between discrete\nperturbations and continuous perturbations. Then we design a regression task as\na PerturbScore to learn the correlation automatically. Through experimental\nresults, we find that we can build a connection between discrete and continuous\nperturbations and use the proposed PerturbScore to learn such correlation,\nsurpassing previous methods used in discrete perturbation measuring. Further,\nthe proposed PerturbScore can be well generalized to different datasets,\nperturbation methods, indicating that we can use it as a powerful tool to study\nmodel robustness in NLP.\n","authors":["Linyang Li","Ke Ren","Yunfan Shao","Pengyu Wang","Xipeng Qiu"],"pdf_url":"https://arxiv.org/pdf/2310.08889v1.pdf","comment":"Accepted by Findings of EMNLP2023"},{"id":"http://arxiv.org/abs/2310.08885v1","updated":"2023-10-13T06:36:26Z","published":"2023-10-13T06:36:26Z","title":"InstructTODS: Large Language Models for End-to-End Task-Oriented\n Dialogue Systems","summary":" Large language models (LLMs) have been used for diverse tasks in natural\nlanguage processing (NLP), yet remain under-explored for task-oriented dialogue\nsystems (TODS), especially for end-to-end TODS. We present InstructTODS, a\nnovel off-the-shelf framework for zero-shot end-to-end task-oriented dialogue\nsystems that can adapt to diverse domains without fine-tuning. By leveraging\nLLMs, InstructTODS generates a proxy belief state that seamlessly translates\nuser intentions into dynamic queries for efficient interaction with any KB. Our\nextensive experiments demonstrate that InstructTODS achieves comparable\nperformance to fully fine-tuned TODS in guiding dialogues to successful\ncompletion without prior knowledge or task-specific data. Furthermore, a\nrigorous human evaluation of end-to-end TODS shows that InstructTODS produces\ndialogue responses that notably outperform both the gold responses and the\nstate-of-the-art TODS in terms of helpfulness, informativeness, and humanness.\nMoreover, the effectiveness of LLMs in TODS is further supported by our\ncomprehensive evaluations on TODS subtasks: dialogue state tracking, intent\nclassification, and response generation. Code and implementations could be\nfound here https://github.com/WillyHC22/InstructTODS/\n","authors":["Willy Chung","Samuel Cahyawijaya","Bryan Wilie","Holy Lovenia","Pascale Fung"],"pdf_url":"https://arxiv.org/pdf/2310.08885v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08877v1","updated":"2023-10-13T06:03:47Z","published":"2023-10-13T06:03:47Z","title":"Retrieval-Generation Alignment for End-to-End Task-Oriented Dialogue\n System","summary":" Developing an efficient retriever to retrieve knowledge from a large-scale\nknowledge base (KB) is critical for task-oriented dialogue systems to\neffectively handle localized and specialized tasks. However, widely used\ngenerative models such as T5 and ChatGPT often struggle to differentiate subtle\ndifferences among the retrieved KB records when generating responses, resulting\nin suboptimal quality of generated responses. In this paper, we propose the\napplication of maximal marginal likelihood to train a perceptive retriever by\nutilizing signals from response generation for supervision. In addition, our\napproach goes beyond considering solely retrieved entities and incorporates\nvarious meta knowledge to guide the generator, thus improving the utilization\nof knowledge. We evaluate our approach on three task-oriented dialogue datasets\nusing T5 and ChatGPT as the backbone models. The results demonstrate that when\ncombined with meta knowledge, the response generator can effectively leverage\nhigh-quality knowledge records from the retriever and enhance the quality of\ngenerated responses. The codes and models of this paper are available at\nhttps://github.com/shenwzh3/MK-TOD.\n","authors":["Weizhou Shen","Yingqi Gao","Canbin Huang","Fanqi Wan","Xiaojun Quan","Wei Bi"],"pdf_url":"https://arxiv.org/pdf/2310.08877v1.pdf","comment":"Accepted to EMNLP 2023 Main Conference"},{"id":"http://arxiv.org/abs/2310.04443v2","updated":"2023-10-13T05:07:49Z","published":"2023-10-02T21:24:26Z","title":"Human Mobility Question Answering (Vision Paper)","summary":" Question answering (QA) systems have attracted much attention from the\nartificial intelligence community as they can learn to answer questions based\non the given knowledge source (e.g., images in visual question answering).\nHowever, the research into question answering systems with human mobility data\nremains unexplored. Mining human mobility data is crucial for various\napplications such as smart city planning, pandemic management, and personalised\nrecommendation system. In this paper, we aim to tackle this gap and introduce a\nnovel task, that is, human mobility question answering (MobQA). The aim of the\ntask is to let the intelligent system learn from mobility data and answer\nrelated questions. This task presents a new paradigm change in mobility\nprediction research and further facilitates the research of human mobility\nrecommendation systems. To better support this novel research topic, this\nvision paper also proposes an initial design of the dataset and a potential\ndeep learning model framework for the introduced MobQA task. We hope that this\npaper will provide novel insights and open new directions in human mobility\nresearch and question answering research.\n","authors":["Hao Xue","Flora D. Salim"],"pdf_url":"https://arxiv.org/pdf/2310.04443v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08860v1","updated":"2023-10-13T05:03:13Z","published":"2023-10-13T05:03:13Z","title":"Guiding AMR Parsing with Reverse Graph Linearization","summary":" Abstract Meaning Representation (AMR) parsing aims to extract an abstract\nsemantic graph from a given sentence. The sequence-to-sequence approaches,\nwhich linearize the semantic graph into a sequence of nodes and edges and\ngenerate the linearized graph directly, have achieved good performance.\nHowever, we observed that these approaches suffer from structure loss\naccumulation during the decoding process, leading to a much lower F1-score for\nnodes and edges decoded later compared to those decoded earlier. To address\nthis issue, we propose a novel Reverse Graph Linearization (RGL) enhanced\nframework. RGL defines both default and reverse linearization orders of an AMR\ngraph, where most structures at the back part of the default order appear at\nthe front part of the reversed order and vice versa. RGL incorporates the\nreversed linearization to the original AMR parser through a two-pass\nself-distillation mechanism, which guides the model when generating the default\nlinearizations. Our analysis shows that our proposed method significantly\nmitigates the problem of structure loss accumulation, outperforming the\npreviously best AMR parsing model by 0.8 and 0.5 Smatch scores on the AMR 2.0\nand AMR 3.0 dataset, respectively. The code are available at\nhttps://github.com/pkunlp-icler/AMR_reverse_graph_linearization.\n","authors":["Bofei Gao","Liang Chen","Peiyi Wang","Zhifang Sui","Baobao Chang"],"pdf_url":"https://arxiv.org/pdf/2310.08860v1.pdf","comment":"Findings of EMNLP2023"},{"id":"http://arxiv.org/abs/2306.17439v2","updated":"2023-10-13T04:50:04Z","published":"2023-06-30T07:24:32Z","title":"Provable Robust Watermarking for AI-Generated Text","summary":" We study the problem of watermarking large language models (LLMs) generated\ntext -- one of the most promising approaches for addressing the safety\nchallenges of LLM usage. In this paper, we propose a rigorous theoretical\nframework to quantify the effectiveness and robustness of LLM watermarks. We\npropose a robust and high-quality watermark method, Unigram-Watermark, by\nextending an existing approach with a simplified fixed grouping strategy. We\nprove that our watermark method enjoys guaranteed generation quality,\ncorrectness in watermark detection, and is robust against text editing and\nparaphrasing. Experiments on three varying LLMs and two datasets verify that\nour Unigram-Watermark achieves superior detection accuracy and comparable\ngeneration quality in perplexity, thus promoting the responsible use of LLMs.\nCode is available at https://github.com/XuandongZhao/Unigram-Watermark.\n","authors":["Xuandong Zhao","Prabhanjan Ananth","Lei Li","Yu-Xiang Wang"],"pdf_url":"https://arxiv.org/pdf/2306.17439v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.10687v2","updated":"2023-10-13T03:38:57Z","published":"2023-09-16T00:55:08Z","title":"EchoPrompt: Instructing the Model to Rephrase Queries for Improved\n In-context Learning","summary":" Language models are achieving impressive performance on various tasks by\naggressively adopting inference-time prompting techniques, such as zero-shot\nand few-shot prompting. In this work, we introduce EchoPrompt, a simple yet\neffective approach that prompts the model to rephrase its queries before\nanswering them. EchoPrompt is adapted for both zero-shot and few-shot\nin-context learning with standard and chain-of-thought prompting. Experimental\nresults show that EchoPrompt yields substantial improvements across all these\nsettings for four families of causal language models. These improvements are\nobserved across various numerical reasoning (e.g. GSM8K, SVAMP), reading\ncomprehension (e.g. DROP), and logical reasoning (e.g. Coin Flipping) tasks. On\naverage, EchoPrompt improves the Zero-shot-CoT performance of code-davinci-002\nby 5% in numerical tasks and 13% in reading comprehension tasks. We investigate\nthe factors contributing to EchoPrompt's effectiveness through ablation\nstudies, which reveal that both the original query and the model-generated\nrephrased version are instrumental in its performance gains. Our empirical\nresults indicate that EchoPrompt is an effective technique that enhances\nin-context learning performance. We recommend incorporating EchoPrompt into\nvarious baseline prompting strategies to achieve performance boosts.\n","authors":["Rajasekhar Reddy Mekala","Yasaman Razeghi","Sameer Singh"],"pdf_url":"https://arxiv.org/pdf/2309.10687v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08840v1","updated":"2023-10-13T03:38:38Z","published":"2023-10-13T03:38:38Z","title":"Large Language Models as Source Planner for Personalized\n Knowledge-grounded Dialogue","summary":" Open-domain dialogue system usually requires different sources of knowledge\nto generate more informative and evidential responses. However, existing\nknowledge-grounded dialogue systems either focus on a single knowledge source\nor overlook the dependency between multiple sources of knowledge, which may\nresult in generating inconsistent or even paradoxical responses. To incorporate\nmultiple knowledge sources and dependencies between them, we propose SAFARI, a\nnovel framework that leverages the exceptional capabilities of large language\nmodels (LLMs) in planning, understanding, and incorporating under both\nsupervised and unsupervised settings. Specifically, SAFARI decouples the\nknowledge grounding into multiple sources and response generation, which allows\neasy extension to various knowledge sources including the possibility of not\nusing any sources. To study the problem, we construct a personalized\nknowledge-grounded dialogue dataset \\textit{\\textbf{K}nowledge \\textbf{B}ehind\n\\textbf{P}ersona}~(\\textbf{KBP}), which is the first to consider the dependency\nbetween persona and implicit knowledge. Experimental results on the KBP dataset\ndemonstrate that the SAFARI framework can effectively produce\npersona-consistent and knowledge-enhanced responses.\n","authors":["Hongru Wang","Minda Hu","Yang Deng","Rui Wang","Fei Mi","Weichao Wang","Yasheng Wang","Wai-Chung Kwan","Irwin King","Kam-Fai Wong"],"pdf_url":"https://arxiv.org/pdf/2310.08840v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.12966v3","updated":"2023-10-13T02:41:28Z","published":"2023-08-24T17:59:17Z","title":"Qwen-VL: A Versatile Vision-Language Model for Understanding,\n Localization, Text Reading, and Beyond","summary":" In this work, we introduce the Qwen-VL series, a set of large-scale\nvision-language models (LVLMs) designed to perceive and understand both texts\nand images. Starting from the Qwen-LM as a foundation, we endow it with visual\ncapacity by the meticulously designed (i) visual receptor, (ii) input-output\ninterface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal\ncleaned corpus. Beyond the conventional image description and\nquestion-answering, we implement the grounding and text-reading ability of\nQwen-VLs by aligning image-caption-box tuples. The resulting models, including\nQwen-VL and Qwen-VL-Chat, set new records for generalist models under similar\nmodel scales on a broad range of visual-centric benchmarks (e.g., image\ncaptioning, question answering, visual grounding) and different settings (e.g.,\nzero-shot, few-shot). Moreover, on real-world dialog benchmarks, our\ninstruction-tuned Qwen-VL-Chat also demonstrates superiority compared to\nexisting vision-language chatbots. Code, demo and models are available at\nhttps://github.com/QwenLM/Qwen-VL.\n","authors":["Jinze Bai","Shuai Bai","Shusheng Yang","Shijie Wang","Sinan Tan","Peng Wang","Junyang Lin","Chang Zhou","Jingren Zhou"],"pdf_url":"https://arxiv.org/pdf/2308.12966v3.pdf","comment":"Code, demo and models are available at\n https://github.com/QwenLM/Qwen-VL"},{"id":"http://arxiv.org/abs/2310.07849v2","updated":"2023-10-13T01:31:59Z","published":"2023-10-11T19:51:13Z","title":"Synthetic Data Generation with Large Language Models for Text\n Classification: Potential and Limitations","summary":" The collection and curation of high-quality training data is crucial for\ndeveloping text classification models with superior performance, but it is\noften associated with significant costs and time investment. Researchers have\nrecently explored using large language models (LLMs) to generate synthetic\ndatasets as an alternative approach. However, the effectiveness of the\nLLM-generated synthetic data in supporting model training is inconsistent\nacross different classification tasks. To better understand factors that\nmoderate the effectiveness of the LLM-generated synthetic data, in this study,\nwe look into how the performance of models trained on these synthetic data may\nvary with the subjectivity of classification. Our results indicate that\nsubjectivity, at both the task level and instance level, is negatively\nassociated with the performance of the model trained on synthetic data. We\nconclude by discussing the implications of our work on the potential and\nlimitations of leveraging LLM for synthetic data generation.\n","authors":["Zhuoyan Li","Hangxiao Zhu","Zhuoran Lu","Ming Yin"],"pdf_url":"https://arxiv.org/pdf/2310.07849v2.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.08475v2","updated":"2023-10-13T01:12:25Z","published":"2023-10-12T16:32:44Z","title":"Can We Edit Multimodal Large Language Models?","summary":" In this paper, we focus on editing Multimodal Large Language Models (MLLMs).\nCompared to editing single-modal LLMs, multimodal model editing is more\nchallenging, which demands a higher level of scrutiny and careful consideration\nin the editing process. To facilitate research in this area, we construct a new\nbenchmark, dubbed MMEdit, for editing multimodal LLMs and establishing a suite\nof innovative metrics for evaluation. We conduct comprehensive experiments\ninvolving various model editing baselines and analyze the impact of editing\ndifferent components for multimodal LLMs. Empirically, we notice that previous\nbaselines can implement editing multimodal LLMs to some extent, but the effect\nis still barely satisfactory, indicating the potential difficulty of this task.\nWe hope that our work can provide the NLP community with insights. Code and\ndataset are available in https://github.com/zjunlp/EasyEdit.\n","authors":["Siyuan Cheng","Bozhong Tian","Qingbin Liu","Xi Chen","Yongheng Wang","Huajun Chen","Ningyu Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.08475v2.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.08797v1","updated":"2023-10-13T01:00:15Z","published":"2023-10-13T01:00:15Z","title":"A Comparative Analysis of Task-Agnostic Distillation Methods for\n Compressing Transformer Language Models","summary":" Large language models have become a vital component in modern NLP, achieving\nstate of the art performance in a variety of tasks. However, they are often\ninefficient for real-world deployment due to their expensive inference costs.\nKnowledge distillation is a promising technique to improve their efficiency\nwhile retaining most of their effectiveness. In this paper, we reproduce,\ncompare and analyze several representative methods for task-agnostic\n(general-purpose) distillation of Transformer language models. Our target of\nstudy includes Output Distribution (OD) transfer, Hidden State (HS) transfer\nwith various layer mapping strategies, and Multi-Head Attention (MHA) transfer\nbased on MiniLMv2. Through our extensive experiments, we study the\neffectiveness of each method for various student architectures in both\nmonolingual (English) and multilingual settings. Overall, we show that MHA\ntransfer based on MiniLMv2 is generally the best option for distillation and\nexplain the potential reasons behind its success. Moreover, we show that HS\ntransfer remains as a competitive baseline, especially under a sophisticated\nlayer mapping strategy, while OD transfer consistently lags behind other\napproaches. Findings from this study helped us deploy efficient yet effective\nstudent models for latency-critical applications.\n","authors":["Takuma Udagawa","Aashka Trivedi","Michele Merler","Bishwaranjan Bhattacharjee"],"pdf_url":"https://arxiv.org/pdf/2310.08797v1.pdf","comment":"Accepted to EMNLP 2023 Industry Track"},{"id":"http://arxiv.org/abs/2310.08796v1","updated":"2023-10-13T00:49:59Z","published":"2023-10-13T00:49:59Z","title":"End-to-end Story Plot Generator","summary":" Story plots, while short, carry most of the essential information of a full\nstory that may contain tens of thousands of words. We study the problem of\nautomatic generation of story plots, which includes story premise, character\ndescriptions, plot outlines, etc. To generate a single engaging plot, existing\nplot generators (e.g., DOC (Yang et al., 2022a)) require hundreds to thousands\nof calls to LLMs (e.g., OpenAI API) in the planning stage of the story plot,\nwhich is costly and takes at least several minutes. Moreover, the hard-wired\nnature of the method makes the pipeline non-differentiable, blocking fast\nspecialization and personalization of the plot generator. In this paper, we\npropose three models, $\\texttt{OpenPlot}$, $\\texttt{E2EPlot}$ and\n$\\texttt{RLPlot}$, to address these challenges. $\\texttt{OpenPlot}$ replaces\nexpensive OpenAI API calls with LLaMA2 (Touvron et al., 2023) calls via careful\nprompt designs, which leads to inexpensive generation of high-quality training\ndatasets of story plots. We then train an end-to-end story plot generator,\n$\\texttt{E2EPlot}$, by supervised fine-tuning (SFT) using approximately 13000\nstory plots generated by $\\texttt{OpenPlot}$. $\\texttt{E2EPlot}$ generates\nstory plots of comparable quality to $\\texttt{OpenPlot}$, and is > 10$\\times$\nfaster (1k tokens in only 30 seconds on average). Finally, we obtain\n$\\texttt{RLPlot}$ that is further fine-tuned with RLHF on several different\nreward models for different aspects of story quality, which yields 60.0$\\%$\nwinning rate against $\\texttt{E2EPlot}$ along the aspect of suspense and\nsurprise.\n","authors":["Hanlin Zhu","Andrew Cohen","Danqing Wang","Kevin Yang","Xiaomeng Yang","Jiantao Jiao","Yuandong Tian"],"pdf_url":"https://arxiv.org/pdf/2310.08796v1.pdf","comment":"17 pages"},{"id":"http://arxiv.org/abs/2310.08795v1","updated":"2023-10-13T00:49:09Z","published":"2023-10-13T00:49:09Z","title":"Mitigating Bias for Question Answering Models by Tracking Bias Influence","summary":" Models of various NLP tasks have been shown to exhibit stereotypes, and the\nbias in the question answering (QA) models is especially harmful as the output\nanswers might be directly consumed by the end users. There have been datasets\nto evaluate bias in QA models, while bias mitigation technique for the QA\nmodels is still under-explored. In this work, we propose BMBI, an approach to\nmitigate the bias of multiple-choice QA models. Based on the intuition that a\nmodel would lean to be more biased if it learns from a biased example, we\nmeasure the bias level of a query instance by observing its influence on\nanother instance. If the influenced instance is more biased, we derive that the\nquery instance is biased. We then use the bias level detected as an\noptimization objective to form a multi-task learning setting in addition to the\noriginal QA task. We further introduce a new bias evaluation metric to quantify\nbias in a comprehensive and sensitive way. We show that our method could be\napplied to multiple QA formulations across multiple bias categories. It can\nsignificantly reduce the bias level in all 9 bias categories in the BBQ dataset\nwhile maintaining comparable QA accuracy.\n","authors":["Mingyu Derek Ma","Jiun-Yu Kao","Arpit Gupta","Yu-Hsiang Lin","Wenbo Zhao","Tagyoung Chung","Wei Wang","Kai-Wei Chang","Nanyun Peng"],"pdf_url":"https://arxiv.org/pdf/2310.08795v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08072v2","updated":"2023-10-13T00:40:29Z","published":"2023-10-12T06:46:07Z","title":"Training Generative Question-Answering on Synthetic Data Obtained from\n an Instruct-tuned Model","summary":" This paper presents a simple and cost-effective method for synthesizing data\nto train question-answering systems. For training, fine-tuning GPT models is a\ncommon practice in resource-rich languages like English, however, it becomes\nchallenging for non-English languages due to the scarcity of sufficient\nquestion-answer (QA) pairs. Existing approaches use question and answer\ngenerators trained on human-authored QA pairs, which involves substantial human\nexpenses. In contrast, we use an instruct-tuned model to generate QA pairs in a\nzero-shot or few-shot manner. We conduct experiments to compare various\nstrategies for obtaining QA pairs from the instruct-tuned model. The results\ndemonstrate that a model trained on our proposed synthetic data achieves\ncomparable performance to a model trained on manually curated datasets, without\nincurring human costs.\n","authors":["Kosuke Takahashi","Takahiro Omi","Kosuke Arima","Tatsuya Ishigaki"],"pdf_url":"https://arxiv.org/pdf/2310.08072v2.pdf","comment":"PACLIC 2023 short paper, 4 pages (6 pages including references), 4\n figures"},{"id":"http://arxiv.org/abs/2310.08780v1","updated":"2023-10-13T00:03:37Z","published":"2023-10-13T00:03:37Z","title":"\"Im not Racist but...\": Discovering Bias in the Internal Knowledge of\n Large Language Models","summary":" Large language models (LLMs) have garnered significant attention for their\nremarkable performance in a continuously expanding set of natural language\nprocessing tasks. However, these models have been shown to harbor inherent\nsocietal biases, or stereotypes, which can adversely affect their performance\nin their many downstream applications. In this paper, we introduce a novel,\npurely prompt-based approach to uncover hidden stereotypes within any arbitrary\nLLM. Our approach dynamically generates a knowledge representation of internal\nstereotypes, enabling the identification of biases encoded within the LLM's\ninternal knowledge. By illuminating the biases present in LLMs and offering a\nsystematic methodology for their analysis, our work contributes to advancing\ntransparency and promoting fairness in natural language processing systems.\n","authors":["Abel Salinas","Louis Penafiel","Robert McCormack","Fred Morstatter"],"pdf_url":"https://arxiv.org/pdf/2310.08780v1.pdf","comment":"Warning: This paper discusses and contains content that is offensive\n or upsetting"}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2310.09291v1","updated":"2023-10-13T17:59:38Z","published":"2023-10-13T17:59:38Z","title":"Vision-by-Language for Training-Free Compositional Image Retrieval","summary":" Given an image and a target modification (e.g an image of the Eiffel tower\nand the text \"without people and at night-time\"), Compositional Image Retrieval\n(CIR) aims to retrieve the relevant target image in a database. While\nsupervised approaches rely on annotating triplets that is costly (i.e. query\nimage, textual modification, and target image), recent research sidesteps this\nneed by using large-scale vision-language models (VLMs), performing Zero-Shot\nCIR (ZS-CIR). However, state-of-the-art approaches in ZS-CIR still require\ntraining task-specific, customized models over large amounts of image-text\npairs. In this work, we propose to tackle CIR in a training-free manner via our\nCompositional Image Retrieval through Vision-by-Language (CIReVL), a simple,\nyet human-understandable and scalable pipeline that effectively recombines\nlarge-scale VLMs with large language models (LLMs). By captioning the reference\nimage using a pre-trained generative VLM and asking a LLM to recompose the\ncaption based on the textual target modification for subsequent retrieval via\ne.g. CLIP, we achieve modular language reasoning. In four ZS-CIR benchmarks, we\nfind competitive, in-part state-of-the-art performance - improving over\nsupervised methods. Moreover, the modularity of CIReVL offers simple\nscalability without re-training, allowing us to both investigate scaling laws\nand bottlenecks for ZS-CIR while easily scaling up to in parts more than double\nof previously reported results. Finally, we show that CIReVL makes CIR\nhuman-understandable by composing image and text in a modular fashion in the\nlanguage domain, thereby making it intervenable, allowing to post-hoc re-align\nfailure cases. Code will be released upon acceptance.\n","authors":["Shyamgopal Karthik","Karsten Roth","Massimiliano Mancini","Zeynep Akata"],"pdf_url":"https://arxiv.org/pdf/2310.09291v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09289v1","updated":"2023-10-13T17:59:02Z","published":"2023-10-13T17:59:02Z","title":"An Unbiased Look at Datasets for Visuo-Motor Pre-Training","summary":" Visual representation learning hold great promise for robotics, but is\nseverely hampered by the scarcity and homogeneity of robotics datasets. Recent\nworks address this problem by pre-training visual representations on\nlarge-scale but out-of-domain data (e.g., videos of egocentric interactions)\nand then transferring them to target robotics tasks. While the field is heavily\nfocused on developing better pre-training algorithms, we find that dataset\nchoice is just as important to this paradigm's success. After all, the\nrepresentation can only learn the structures or priors present in the\npre-training dataset. To this end, we flip the focus on algorithms, and instead\nconduct a dataset centric analysis of robotic pre-training. Our findings call\ninto question some common wisdom in the field. We observe that traditional\nvision datasets (like ImageNet, Kinetics and 100 Days of Hands) are\nsurprisingly competitive options for visuo-motor representation learning, and\nthat the pre-training dataset's image distribution matters more than its size.\nFinally, we show that common simulation benchmarks are not a reliable proxy for\nreal world performance and that simple regularization strategies can\ndramatically improve real world policy learning.\nhttps://data4robotics.github.io\n","authors":["Sudeep Dasari","Mohan Kumar Srirama","Unnat Jain","Abhinav Gupta"],"pdf_url":"https://arxiv.org/pdf/2310.09289v1.pdf","comment":"Accepted to CoRL 2023"},{"id":"http://arxiv.org/abs/2310.09285v1","updated":"2023-10-13T17:52:16Z","published":"2023-10-13T17:52:16Z","title":"SAIR: Learning Semantic-aware Implicit Representation","summary":" Implicit representation of an image can map arbitrary coordinates in the\ncontinuous domain to their corresponding color values, presenting a powerful\ncapability for image reconstruction. Nevertheless, existing implicit\nrepresentation approaches only focus on building continuous appearance mapping,\nignoring the continuities of the semantic information across pixels. As a\nresult, they can hardly achieve desired reconstruction results when the\nsemantic information within input images is corrupted, for example, a large\nregion misses. To address the issue, we propose to learn semantic-aware\nimplicit representation (SAIR), that is, we make the implicit representation of\neach pixel rely on both its appearance and semantic information (\\eg, which\nobject does the pixel belong to). To this end, we propose a framework with two\nmodules: (1) building a semantic implicit representation (SIR) for a corrupted\nimage whose large regions miss. Given an arbitrary coordinate in the continuous\ndomain, we can obtain its respective text-aligned embedding indicating the\nobject the pixel belongs. (2) building an appearance implicit representation\n(AIR) based on the SIR. Given an arbitrary coordinate in the continuous domain,\nwe can reconstruct its color whether or not the pixel is missed in the input.\nWe validate the novel semantic-aware implicit representation method on the\nimage inpainting task, and the extensive experiments demonstrate that our\nmethod surpasses state-of-the-art approaches by a significant margin.\n","authors":["Canyu Zhang","Xiaoguang Li","Qing Guo","Song Wang"],"pdf_url":"https://arxiv.org/pdf/2310.09285v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.15523v3","updated":"2023-10-13T17:47:41Z","published":"2023-05-24T19:20:59Z","title":"Task-aware Distributed Source Coding under Dynamic Bandwidth","summary":" Efficient compression of correlated data is essential to minimize\ncommunication overload in multi-sensor networks. In such networks, each sensor\nindependently compresses the data and transmits them to a central node due to\nlimited communication bandwidth. A decoder at the central node decompresses and\npasses the data to a pre-trained machine learning-based task to generate the\nfinal output. Thus, it is important to compress the features that are relevant\nto the task. Additionally, the final performance depends heavily on the total\navailable bandwidth. In practice, it is common to encounter varying\navailability in bandwidth, and higher bandwidth results in better performance\nof the task. We design a novel distributed compression framework composed of\nindependent encoders and a joint decoder, which we call neural distributed\nprincipal component analysis (NDPCA). NDPCA flexibly compresses data from\nmultiple sources to any available bandwidth with a single model, reducing\ncomputing and storage overhead. NDPCA achieves this by learning low-rank task\nrepresentations and efficiently distributing bandwidth among sensors, thus\nproviding a graceful trade-off between performance and bandwidth. Experiments\nshow that NDPCA improves the success rate of multi-view robotic arm\nmanipulation by 9% and the accuracy of object detection tasks on satellite\nimagery by 14% compared to an autoencoder with uniform bandwidth allocation.\n","authors":["Po-han Li","Sravan Kumar Ankireddy","Ruihan Zhao","Hossein Nourkhiz Mahjoub","Ehsan Moradi-Pari","Ufuk Topcu","Sandeep Chinchali","Hyeji Kim"],"pdf_url":"https://arxiv.org/pdf/2305.15523v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09276v1","updated":"2023-10-13T17:38:45Z","published":"2023-10-13T17:38:45Z","title":"Transformer-based Multimodal Change Detection with Multitask Consistency\n Constraints","summary":" Change detection plays a fundamental role in Earth observation for analyzing\ntemporal iterations over time. However, recent studies have largely neglected\nthe utilization of multimodal data that presents significant practical and\ntechnical advantages compared to single-modal approaches. This research focuses\non leveraging digital surface model (DSM) data and aerial images captured at\ndifferent times for detecting change beyond 2D. We observe that the current\nchange detection methods struggle with the multitask conflicts between semantic\nand height change detection tasks. To address this challenge, we propose an\nefficient Transformer-based network that learns shared representation between\ncross-dimensional inputs through cross-attention. It adopts a consistency\nconstraint to establish the multimodal relationship, which involves obtaining\npseudo change through height change thresholding and minimizing the difference\nbetween semantic and pseudo change within their overlapping regions. A\nDSM-to-image multimodal dataset encompassing three cities in the Netherlands\nwas constructed. It lays a new foundation for beyond-2D change detection from\ncross-dimensional inputs. Compared to five state-of-the-art change detection\nmethods, our model demonstrates consistent multitask superiority in terms of\nsemantic and height change detection. Furthermore, the consistency strategy can\nbe seamlessly adapted to the other methods, yielding promising improvements.\n","authors":["Biyuan Liu","Huaixin Chen","Kun Li","Michael Ying Yang"],"pdf_url":"https://arxiv.org/pdf/2310.09276v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09275v1","updated":"2023-10-13T17:38:41Z","published":"2023-10-13T17:38:41Z","title":"Understanding and Modeling the Effects of Task and Context on Drivers'\n Gaze Allocation","summary":" Understanding what drivers look at is important for many applications,\nincluding driver training, monitoring, and assistance, as well as self-driving.\nTraditionally, factors affecting human visual attention have been divided into\nbottom-up (involuntary attraction to salient regions) and top-down (task- and\ncontext-driven). Although both play a role in drivers' gaze allocation, most of\nthe existing modeling approaches apply techniques developed for bottom-up\nsaliency and do not consider task and context influences explicitly. Likewise,\ncommon driving attention benchmarks lack relevant task and context annotations.\nTherefore, to enable analysis and modeling of these factors for drivers' gaze\nprediction, we propose the following: 1) address some shortcomings of the\npopular DR(eye)VE dataset and extend it with per-frame annotations for driving\ntask and context; 2) benchmark a number of baseline and SOTA models for\nsaliency and driver gaze prediction and analyze them w.r.t. the new\nannotations; and finally, 3) a novel model that modulates drivers' gaze\nprediction with explicit action and context information, and as a result\nsignificantly improves SOTA performance on DR(eye)VE overall (by 24\\% KLD and\n89\\% NSS) and on a subset of action and safety-critical intersection scenarios\n(by 10--30\\% KLD). Extended annotations, code for model and evaluation will be\nmade publicly available.\n","authors":["Iuliia Kotseruba","John K. Tsotsos"],"pdf_url":"https://arxiv.org/pdf/2310.09275v1.pdf","comment":"12 pages, 8 figures, 8 tables"},{"id":"http://arxiv.org/abs/2310.09247v1","updated":"2023-10-13T16:53:25Z","published":"2023-10-13T16:53:25Z","title":"Hypernymy Understanding Evaluation of Text-to-Image Models via WordNet\n Hierarchy","summary":" Text-to-image synthesis has recently attracted widespread attention due to\nrapidly improving quality and numerous practical applications. However, the\nlanguage understanding capabilities of text-to-image models are still poorly\nunderstood, which makes it difficult to reason about prompt formulations that a\ngiven model would understand well. In this work, we measure the capability of\npopular text-to-image models to understand $\\textit{hypernymy}$, or the \"is-a\"\nrelation between words. We design two automatic metrics based on the WordNet\nsemantic hierarchy and existing image classifiers pretrained on ImageNet. These\nmetrics both enable broad quantitative comparison of linguistic capabilities\nfor text-to-image models and offer a way of finding fine-grained qualitative\ndifferences, such as words that are unknown to models and thus are difficult\nfor them to draw. We comprehensively evaluate popular text-to-image models,\nincluding GLIDE, Latent Diffusion, and Stable Diffusion, showing how our\nmetrics can provide a better understanding of the individual strengths and\nweaknesses of these models.\n","authors":["Anton Baryshnikov","Max Ryabinin"],"pdf_url":"https://arxiv.org/pdf/2310.09247v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09236v1","updated":"2023-10-13T16:40:29Z","published":"2023-10-13T16:40:29Z","title":"Time CNN and Graph Convolution Network for Epileptic Spike Detection in\n MEG Data","summary":" Magnetoencephalography (MEG) recordings of patients with epilepsy exhibit\nspikes, a typical biomarker of the pathology. Detecting those spikes allows\naccurate localization of brain regions triggering seizures. Spike detection is\noften performed manually. However, it is a burdensome and error prone task due\nto the complexity of MEG data. To address this problem, we propose a 1D\ntemporal convolutional neural network (Time CNN) coupled with a graph\nconvolutional network (GCN) to classify short time frames of MEG recording as\ncontaining a spike or not. Compared to other recent approaches, our models have\nfewer parameters to train and we propose to use a GCN to account for MEG\nsensors spatial relationships. Our models produce clinically relevant results\nand outperform deep learning-based state-of-the-art methods reaching a\nclassification f1-score of 76.7% on a balanced dataset and of 25.5% on a\nrealistic, highly imbalanced dataset, for the spike class.\n","authors":["Pauline Mouches","Thibaut Dejean","Julien Jung","Romain Bouet","Carole Lartizien","Romain Quentin"],"pdf_url":"https://arxiv.org/pdf/2310.09236v1.pdf","comment":"This work has been submitted to IEEE ISBI 2024 for possible\n publication"},{"id":"http://arxiv.org/abs/2310.09221v1","updated":"2023-10-13T16:18:48Z","published":"2023-10-13T16:18:48Z","title":"Ultrasound Image Segmentation of Thyroid Nodule via Latent Semantic\n Feature Co-Registration","summary":" Segmentation of nodules in thyroid ultrasound imaging plays a crucial role in\nthe detection and treatment of thyroid cancer. However, owing to the diversity\nof scanner vendors and imaging protocols in different hospitals, the automatic\nsegmentation model, which has already demonstrated expert-level accuracy in the\nfield of medical image segmentation, finds its accuracy reduced as the result\nof its weak generalization performance when being applied in clinically\nrealistic environments. To address this issue, the present paper proposes ASTN,\na framework for thyroid nodule segmentation achieved through a new type\nco-registration network. By extracting latent semantic information from the\natlas and target images and utilizing in-depth features to accomplish the\nco-registration of nodules in thyroid ultrasound images, this framework can\nensure the integrity of anatomical structure and reduce the impact on\nsegmentation as the result of overall differences in image caused by different\ndevices. In addition, this paper also provides an atlas selection algorithm to\nmitigate the difficulty of co-registration. As shown by the evaluation results\ncollected from the datasets of different devices, thanks to the method we\nproposed, the model generalization has been greatly improved while maintaining\na high level of segmentation accuracy.\n","authors":["Xuewei Li","Yaqiao Zhu","Jie Gao","Xi Wei","Ruixuan Zhang","Yuan Tian","Mei Yu"],"pdf_url":"https://arxiv.org/pdf/2310.09221v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.03026v2","updated":"2023-10-13T16:13:43Z","published":"2023-10-04T17:59:49Z","title":"LanguageMPC: Large Language Models as Decision Makers for Autonomous\n Driving","summary":" Existing learning-based autonomous driving (AD) systems face challenges in\ncomprehending high-level information, generalizing to rare events, and\nproviding interpretability. To address these problems, this work employs Large\nLanguage Models (LLMs) as a decision-making component for complex AD scenarios\nthat require human commonsense understanding. We devise cognitive pathways to\nenable comprehensive reasoning with LLMs, and develop algorithms for\ntranslating LLM decisions into actionable driving commands. Through this\napproach, LLM decisions are seamlessly integrated with low-level controllers by\nguided parameter matrix adaptation. Extensive experiments demonstrate that our\nproposed method not only consistently surpasses baseline approaches in\nsingle-vehicle tasks, but also helps handle complex driving behaviors even\nmulti-vehicle coordination, thanks to the commonsense reasoning capabilities of\nLLMs. This paper presents an initial step toward leveraging LLMs as effective\ndecision-makers for intricate AD scenarios in terms of safety, efficiency,\ngeneralizability, and interoperability. We aspire for it to serve as\ninspiration for future research in this field. Project page:\nhttps://sites.google.com/view/llm-mpc\n","authors":["Hao Sha","Yao Mu","Yuxuan Jiang","Li Chen","Chenfeng Xu","Ping Luo","Shengbo Eben Li","Masayoshi Tomizuka","Wei Zhan","Mingyu Ding"],"pdf_url":"https://arxiv.org/pdf/2310.03026v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.09109v2","updated":"2023-10-13T16:12:32Z","published":"2023-06-15T13:11:30Z","title":"NAVI: Category-Agnostic Image Collections with High-Quality 3D Shape and\n Pose Annotations","summary":" Recent advances in neural reconstruction enable high-quality 3D object\nreconstruction from casually captured image collections. Current techniques\nmostly analyze their progress on relatively simple image collections where\nStructure-from-Motion (SfM) techniques can provide ground-truth (GT) camera\nposes. We note that SfM techniques tend to fail on in-the-wild image\ncollections such as image search results with varying backgrounds and\nilluminations. To enable systematic research progress on 3D reconstruction from\ncasual image captures, we propose NAVI: a new dataset of category-agnostic\nimage collections of objects with high-quality 3D scans along with per-image\n2D-3D alignments providing near-perfect GT camera parameters. These 2D-3D\nalignments allow us to extract accurate derivative annotations such as dense\npixel correspondences, depth and segmentation maps. We demonstrate the use of\nNAVI image collections on different problem settings and show that NAVI enables\nmore thorough evaluations that were not possible with existing datasets. We\nbelieve NAVI is beneficial for systematic research progress on 3D\nreconstruction and correspondence estimation. Project page:\nhttps://navidataset.github.io\n","authors":["Varun Jampani","Kevis-Kokitsi Maninis","Andreas Engelhardt","Arjun Karpur","Karen Truong","Kyle Sargent","Stefan Popov","André Araujo","Ricardo Martin-Brualla","Kaushal Patel","Daniel Vlasic","Vittorio Ferrari","Ameesh Makadia","Ce Liu","Yuanzhen Li","Howard Zhou"],"pdf_url":"https://arxiv.org/pdf/2306.09109v2.pdf","comment":"NeurIPS 2023 camera ready. Project page:\n https://navidataset.github.io"},{"id":"http://arxiv.org/abs/2310.09213v1","updated":"2023-10-13T16:07:31Z","published":"2023-10-13T16:07:31Z","title":"Unseen Image Synthesis with Diffusion Models","summary":" While the current trend in the generative field is scaling up towards larger\nmodels and more training data for generalized domain representations, we go the\nopposite direction in this work by synthesizing unseen domain images without\nadditional training. We do so via latent sampling and geometric optimization\nusing pre-trained and frozen Denoising Diffusion Probabilistic Models (DDPMs)\non single-domain datasets. Our key observation is that DDPMs pre-trained even\njust on single-domain images are already equipped with sufficient\nrepresentation abilities to reconstruct arbitrary images from the inverted\nlatent encoding following bi-directional deterministic diffusion and denoising\ntrajectories. This motivates us to investigate the statistical and geometric\nbehaviors of the Out-Of-Distribution (OOD) samples from unseen image domains in\nthe latent spaces along the denoising chain. Notably, we theoretically and\nempirically show that the inverted OOD samples also establish Gaussians that\nare distinguishable from the original In-Domain (ID) samples in the\nintermediate latent spaces, which allows us to sample from them directly.\nGeometrical domain-specific and model-dependent information of the unseen\nsubspace (e.g., sample-wise distance and angles) is used to further optimize\nthe sampled OOD latent encodings from the estimated Gaussian prior. We conduct\nextensive analysis and experiments using pre-trained diffusion models (DDPM,\niDDPM) on different datasets (AFHQ, CelebA-HQ, LSUN-Church, and LSUN-Bedroom),\nproving the effectiveness of this novel perspective to explore and re-think the\ndiffusion models' data synthesis generalization ability.\n","authors":["Ye Zhu","Yu Wu","Zhiwei Deng","Olga Russakovsky","Yan Yan"],"pdf_url":"https://arxiv.org/pdf/2310.09213v1.pdf","comment":"28 pages including appendices"},{"id":"http://arxiv.org/abs/2303.10959v2","updated":"2023-10-13T15:56:51Z","published":"2023-03-20T09:33:05Z","title":"Constructing Metric-Semantic Maps using Floor Plan Priors for Long-Term\n Indoor Localization","summary":" Object-based maps are relevant for scene understanding since they integrate\ngeometric and semantic information of the environment, allowing autonomous\nrobots to robustly localize and interact with on objects. In this paper, we\naddress the task of constructing a metric-semantic map for the purpose of\nlong-term object-based localization. We exploit 3D object detections from\nmonocular RGB frames for both, the object-based map construction, and for\nglobally localizing in the constructed map. To tailor the approach to a target\nenvironment, we propose an efficient way of generating 3D annotations to\nfinetune the 3D object detection model. We evaluate our map construction in an\noffice building, and test our long-term localization approach on challenging\nsequences recorded in the same environment over nine months. The experiments\nsuggest that our approach is suitable for constructing metric-semantic maps,\nand that our localization approach is robust to long-term changes. Both, the\nmapping algorithm and the localization pipeline can run online on an onboard\ncomputer. We release an open-source C++/ROS implementation of our approach.\n","authors":["Nicky Zimmerman","Matteo Sodano","Elias Marks","Jens Behley","Cyrill Stachniss"],"pdf_url":"https://arxiv.org/pdf/2303.10959v2.pdf","comment":"7 pages, accepted to IROS 2023"},{"id":"http://arxiv.org/abs/2310.09199v1","updated":"2023-10-13T15:45:19Z","published":"2023-10-13T15:45:19Z","title":"PaLI-3 Vision Language Models: Smaller, Faster, Stronger","summary":" This paper presents PaLI-3, a smaller, faster, and stronger vision language\nmodel (VLM) that compares favorably to similar models that are 10x larger. As\npart of arriving at this strong performance, we compare Vision Transformer\n(ViT) models pretrained using classification objectives to contrastively\n(SigLIP) pretrained ones. We find that, while slightly underperforming on\nstandard image classification benchmarks, SigLIP-based PaLI shows superior\nperformance across various multimodal benchmarks, especially on localization\nand visually-situated text understanding. We scale the SigLIP image encoder up\nto 2 billion parameters, and achieves a new state-of-the-art on multilingual\ncross-modal retrieval. We hope that PaLI-3, at only 5B parameters, rekindles\nresearch on fundamental pieces of complex VLMs, and could fuel a new generation\nof scaled-up models.\n","authors":["Xi Chen","Xiao Wang","Lucas Beyer","Alexander Kolesnikov","Jialin Wu","Paul Voigtlaender","Basil Mustafa","Sebastian Goodman","Ibrahim Alabdulmohsin","Piotr Padlewski","Daniel Salz","Xi Xiong","Daniel Vlasic","Filip Pavetic","Keran Rong","Tianli Yu","Daniel Keysers","Xiaohua Zhai","Radu Soricut"],"pdf_url":"https://arxiv.org/pdf/2310.09199v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.16670v2","updated":"2023-10-13T15:45:13Z","published":"2023-09-28T17:59:51Z","title":"Decaf: Monocular Deformation Capture for Face and Hand Interactions","summary":" Existing methods for 3D tracking from monocular RGB videos predominantly\nconsider articulated and rigid objects. Modelling dense non-rigid object\ndeformations in this setting remained largely unaddressed so far, although such\neffects can improve the realism of the downstream applications such as AR/VR\nand avatar communications. This is due to the severe ill-posedness of the\nmonocular view setting and the associated challenges. While it is possible to\nnaively track multiple non-rigid objects independently using 3D templates or\nparametric 3D models, such an approach would suffer from multiple artefacts in\nthe resulting 3D estimates such as depth ambiguity, unnatural intra-object\ncollisions and missing or implausible deformations. Hence, this paper\nintroduces the first method that addresses the fundamental challenges depicted\nabove and that allows tracking human hands interacting with human faces in 3D\nfrom single monocular RGB videos. We model hands as articulated objects\ninducing non-rigid face deformations during an active interaction. Our method\nrelies on a new hand-face motion and interaction capture dataset with realistic\nface deformations acquired with a markerless multi-view camera system. As a\npivotal step in its creation, we process the reconstructed raw 3D shapes with\nposition-based dynamics and an approach for non-uniform stiffness estimation of\nthe head tissues, which results in plausible annotations of the surface\ndeformations, hand-face contact regions and head-hand positions. At the core of\nour neural approach are a variational auto-encoder supplying the hand-face\ndepth prior and modules that guide the 3D tracking by estimating the contacts\nand the deformations. Our final 3D hand and face reconstructions are realistic\nand more plausible compared to several baselines applicable in our setting,\nboth quantitatively and qualitatively.\nhttps://vcai.mpi-inf.mpg.de/projects/Decaf\n","authors":["Soshi Shimada","Vladislav Golyanik","Patrick Pérez","Christian Theobalt"],"pdf_url":"https://arxiv.org/pdf/2309.16670v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.17216v3","updated":"2023-10-13T15:35:42Z","published":"2023-05-26T19:22:03Z","title":"Generating Images with Multimodal Language Models","summary":" We propose a method to fuse frozen text-only large language models (LLMs)\nwith pre-trained image encoder and decoder models, by mapping between their\nembedding spaces. Our model demonstrates a wide suite of multimodal\ncapabilities: image retrieval, novel image generation, and multimodal dialogue.\nOurs is the first approach capable of conditioning on arbitrarily interleaved\nimage and text inputs to generate coherent image (and text) outputs. To achieve\nstrong performance on image generation, we propose an efficient mapping network\nto ground the LLM to an off-the-shelf text-to-image generation model. This\nmapping network translates hidden representations of text into the embedding\nspace of the visual models, enabling us to leverage the strong text\nrepresentations of the LLM for visual outputs. Our approach outperforms\nbaseline generation models on tasks with longer and more complex language. In\naddition to novel image generation, our model is also capable of image\nretrieval from a prespecified dataset, and decides whether to retrieve or\ngenerate at inference time. This is done with a learnt decision module which\nconditions on the hidden representations of the LLM. Our model exhibits a wider\nrange of capabilities compared to prior multimodal language models. It can\nprocess image-and-text inputs, and produce retrieved images, generated images,\nand generated text -- outperforming non-LLM based generation models across\nseveral text-to-image tasks that measure context dependence.\n","authors":["Jing Yu Koh","Daniel Fried","Ruslan Salakhutdinov"],"pdf_url":"https://arxiv.org/pdf/2305.17216v3.pdf","comment":"NeurIPS 2023. Project page: http://jykoh.com/gill"},{"id":"http://arxiv.org/abs/2308.02490v2","updated":"2023-10-13T15:16:59Z","published":"2023-08-04T17:59:47Z","title":"MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities","summary":" We propose MM-Vet, an evaluation benchmark that examines large multimodal\nmodels (LMMs) on complicated multimodal tasks. Recent LMMs have shown various\nintriguing abilities, such as solving math problems written on the blackboard,\nreasoning about events and celebrities in news images, and explaining visual\njokes. Rapid model advancements pose challenges to evaluation benchmark\ndevelopment. Problems include: (1) How to systematically structure and evaluate\nthe complicated multimodal tasks; (2) How to design evaluation metrics that\nwork well across question and answer types; and (3) How to give model insights\nbeyond a simple performance ranking. To this end, we present MM-Vet, designed\nbased on the insight that the intriguing ability to solve complicated tasks is\noften achieved by a generalist model being able to integrate different core\nvision-language (VL) capabilities. MM-Vet defines 6 core VL capabilities and\nexamines the 16 integrations of interest derived from the capability\ncombination. For evaluation metrics, we propose an LLM-based evaluator for\nopen-ended outputs. The evaluator enables the evaluation across different\nquestion types and answer styles, resulting in a unified scoring metric. We\nevaluate representative LMMs on MM-Vet, providing insights into the\ncapabilities of different LMM system paradigms and models. Code and data are\navailable at https://github.com/yuweihao/MM-Vet.\n","authors":["Weihao Yu","Zhengyuan Yang","Linjie Li","Jianfeng Wang","Kevin Lin","Zicheng Liu","Xinchao Wang","Lijuan Wang"],"pdf_url":"https://arxiv.org/pdf/2308.02490v2.pdf","comment":"Update results of OpenFlamingo-9B (MPT), LLaMA-Adapter v2-7B, and\n Otter-9B (MPT). Code, data and leaderboard:\n https://github.com/yuweihao/MM-Vet"},{"id":"http://arxiv.org/abs/2310.09170v1","updated":"2023-10-13T15:03:21Z","published":"2023-10-13T15:03:21Z","title":"mnmDTW: An extension to Dynamic Time Warping for Camera-based Movement\n Error Localization","summary":" In this proof of concept, we use Computer Vision (CV) methods to extract pose\ninformation out of exercise videos. We then employ a modified version of\nDynamic Time Warping (DTW) to calculate the deviation from a gold standard\nexecution of the exercise. Specifically, we calculate the distance between each\nbody part individually to get a more precise measure for exercise accuracy. We\ncan show that exercise mistakes are clearly visible, identifiable and\nlocalizable through this metric.\n","authors":["Sebastian Dill","Maurice Rohr"],"pdf_url":"https://arxiv.org/pdf/2310.09170v1.pdf","comment":"Poster Prague 2023 Conference, 4 pages"},{"id":"http://arxiv.org/abs/2310.05768v2","updated":"2023-10-13T15:00:11Z","published":"2023-10-09T14:54:37Z","title":"DANet: Enhancing Small Object Detection through an Efficient Deformable\n Attention Network","summary":" Efficient and accurate detection of small objects in manufacturing settings,\nsuch as defects and cracks, is crucial for ensuring product quality and safety.\nTo address this issue, we proposed a comprehensive strategy by synergizing\nFaster R-CNN with cutting-edge methods. By combining Faster R-CNN with Feature\nPyramid Network, we enable the model to efficiently handle multi-scale features\nintrinsic to manufacturing environments. Additionally, Deformable Net is used\nthat contorts and conforms to the geometric variations of defects, bringing\nprecision in detecting even the minuscule and complex features. Then, we\nincorporated an attention mechanism called Convolutional Block Attention Module\nin each block of our base ResNet50 network to selectively emphasize informative\nfeatures and suppress less useful ones. After that we incorporated RoI Align,\nreplacing RoI Pooling for finer region-of-interest alignment and finally the\nintegration of Focal Loss effectively handles class imbalance, crucial for rare\ndefect occurrences. The rigorous evaluation of our model on both the NEU-DET\nand Pascal VOC datasets underscores its robust performance and generalization\ncapabilities. On the NEU-DET dataset, our model exhibited a profound\nunderstanding of steel defects, achieving state-of-the-art accuracy in\nidentifying various defects. Simultaneously, when evaluated on the Pascal VOC\ndataset, our model showcases its ability to detect objects across a wide\nspectrum of categories within complex and small scenes.\n","authors":["Md Sohag Mia","Abdullah Al Bary Voban","Abu Bakor Hayat Arnob","Abdu Naim","Md Kawsar Ahmed","Md Shariful Islam"],"pdf_url":"https://arxiv.org/pdf/2310.05768v2.pdf","comment":"ICCD-23"},{"id":"http://arxiv.org/abs/2306.16585v2","updated":"2023-10-13T14:56:58Z","published":"2023-06-28T22:36:44Z","title":"SeMLaPS: Real-time Semantic Mapping with Latent Prior Networks and\n Quasi-Planar Segmentation","summary":" The availability of real-time semantics greatly improves the core geometric\nfunctionality of SLAM systems, enabling numerous robotic and AR/VR\napplications. We present a new methodology for real-time semantic mapping from\nRGB-D sequences that combines a 2D neural network and a 3D network based on a\nSLAM system with 3D occupancy mapping. When segmenting a new frame we perform\nlatent feature re-projection from previous frames based on differentiable\nrendering. Fusing re-projected feature maps from previous frames with\ncurrent-frame features greatly improves image segmentation quality, compared to\na baseline that processes images independently. For 3D map processing, we\npropose a novel geometric quasi-planar over-segmentation method that groups 3D\nmap elements likely to belong to the same semantic classes, relying on surface\nnormals. We also describe a novel neural network design for lightweight\nsemantic map post-processing. Our system achieves state-of-the-art semantic\nmapping quality within 2D-3D networks-based systems and matches the performance\nof 3D convolutional networks on three real indoor datasets, while working in\nreal-time. Moreover, it shows better cross-sensor generalization abilities\ncompared to 3D CNNs, enabling training and inference with different depth\nsensors. Code and data will be released on project page:\nhttp://jingwenwang95.github.io/SeMLaPS\n","authors":["Jingwen Wang","Juan Tarrio","Lourdes Agapito","Pablo F. Alcantarilla","Alexander Vakhitov"],"pdf_url":"https://arxiv.org/pdf/2306.16585v2.pdf","comment":"RA-L 2023. 8 pages, 7 figures. Project page:\n http://jingwenwang95.github.io/SeMLaPS"},{"id":"http://arxiv.org/abs/2310.09147v1","updated":"2023-10-13T14:39:34Z","published":"2023-10-13T14:39:34Z","title":"Exploring Sparse Spatial Relation in Graph Inference for Text-Based VQA","summary":" Text-based visual question answering (TextVQA) faces the significant\nchallenge of avoiding redundant relational inference. To be specific, a large\nnumber of detected objects and optical character recognition (OCR) tokens\nresult in rich visual relationships. Existing works take all visual\nrelationships into account for answer prediction. However, there are three\nobservations: (1) a single subject in the images can be easily detected as\nmultiple objects with distinct bounding boxes (considered repetitive objects).\nThe associations between these repetitive objects are superfluous for answer\nreasoning; (2) two spatially distant OCR tokens detected in the image\nfrequently have weak semantic dependencies for answer reasoning; and (3) the\nco-existence of nearby objects and tokens may be indicative of important visual\ncues for predicting answers. Rather than utilizing all of them for answer\nprediction, we make an effort to identify the most important connections or\neliminate redundant ones. We propose a sparse spatial graph network (SSGN) that\nintroduces a spatially aware relation pruning technique to this task. As\nspatial factors for relation measurement, we employ spatial distance, geometric\ndimension, overlap area, and DIoU for spatially aware pruning. We consider\nthree visual relationships for graph learning: object-object, OCR-OCR tokens,\nand object-OCR token relationships. SSGN is a progressive graph learning\narchitecture that verifies the pivotal relations in the correlated object-token\nsparse graph, and then in the respective object-based sparse graph and\ntoken-based sparse graph. Experiment results on TextVQA and ST-VQA datasets\ndemonstrate that SSGN achieves promising performances. And some visualization\nresults further demonstrate the interpretability of our method.\n","authors":["Sheng Zhou","Dan Guo","Jia Li","Xun Yang","Meng Wang"],"pdf_url":"https://arxiv.org/pdf/2310.09147v1.pdf","comment":"Accepted by TIP 2023"},{"id":"http://arxiv.org/abs/2310.05664v2","updated":"2023-10-13T14:36:36Z","published":"2023-10-09T12:31:30Z","title":"ViTs are Everywhere: A Comprehensive Study Showcasing Vision\n Transformers in Different Domain","summary":" Transformer design is the de facto standard for natural language processing\ntasks. The success of the transformer design in natural language processing has\nlately piqued the interest of researchers in the domain of computer vision.\nWhen compared to Convolutional Neural Networks (CNNs), Vision Transformers\n(ViTs) are becoming more popular and dominant solutions for many vision\nproblems. Transformer-based models outperform other types of networks, such as\nconvolutional and recurrent neural networks, in a range of visual benchmarks.\nWe evaluate various vision transformer models in this work by dividing them\ninto distinct jobs and examining their benefits and drawbacks. ViTs can\novercome several possible difficulties with convolutional neural networks\n(CNNs). The goal of this survey is to show the first use of ViTs in CV. In the\nfirst phase, we categorize various CV applications where ViTs are appropriate.\nImage classification, object identification, image segmentation, video\ntransformer, image denoising, and NAS are all CV applications. Our next step\nwill be to analyze the state-of-the-art in each area and identify the models\nthat are currently available. In addition, we outline numerous open research\ndifficulties as well as prospective research possibilities.\n","authors":["Md Sohag Mia","Abu Bakor Hayat Arnob","Abdu Naim","Abdullah Al Bary Voban","Md Shariful Islam"],"pdf_url":"https://arxiv.org/pdf/2310.05664v2.pdf","comment":"ICCD-2023. arXiv admin note: substantial text overlap with\n arXiv:2208.04309 by other authors"},{"id":"http://arxiv.org/abs/2310.09126v1","updated":"2023-10-13T14:14:43Z","published":"2023-10-13T14:14:43Z","title":"Physics-guided Noise Neural Proxy for Low-light Raw Image Denoising","summary":" Low-light raw image denoising plays a crucial role in mobile photography, and\nlearning-based methods have become the mainstream approach. Training the\nlearning-based methods with synthetic data emerges as an efficient and\npractical alternative to paired real data. However, the quality of synthetic\ndata is inherently limited by the low accuracy of the noise model, which\ndecreases the performance of low-light raw image denoising. In this paper, we\ndevelop a novel framework for accurate noise modeling that learns a\nphysics-guided noise neural proxy (PNNP) from dark frames. PNNP integrates\nthree efficient techniques: physics-guided noise decoupling (PND),\nphysics-guided proxy model (PPM), and differentiable distribution-oriented loss\n(DDL). The PND decouples the dark frame into different components and handles\ndifferent levels of noise in a flexible manner, which reduces the complexity of\nthe noise neural proxy. The PPM incorporates physical priors to effectively\nconstrain the generated noise, which promotes the accuracy of the noise neural\nproxy. The DDL provides explicit and reliable supervision for noise modeling,\nwhich promotes the precision of the noise neural proxy. Extensive experiments\non public low-light raw image denoising datasets and real low-light imaging\nscenarios demonstrate the superior performance of our PNNP framework.\n","authors":["Hansen Feng","Lizhi Wang","Yiqi Huang","Yuzhi Wang","Hua Huang"],"pdf_url":"https://arxiv.org/pdf/2310.09126v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09125v1","updated":"2023-10-13T14:14:00Z","published":"2023-10-13T14:14:00Z","title":"Training and Predicting Visual Error for Real-Time Applications","summary":" Visual error metrics play a fundamental role in the quantification of\nperceived image similarity. Most recently, use cases for them in real-time\napplications have emerged, such as content-adaptive shading and shading reuse\nto increase performance and improve efficiency. A wide range of different\nmetrics has been established, with the most sophisticated being capable of\ncapturing the perceptual characteristics of the human visual system. However,\ntheir complexity, computational expense, and reliance on reference images to\ncompare against prevent their generalized use in real-time, restricting such\napplications to using only the simplest available metrics. In this work, we\nexplore the abilities of convolutional neural networks to predict a variety of\nvisual metrics without requiring either reference or rendered images.\nSpecifically, we train and deploy a neural network to estimate the visual error\nresulting from reusing shading or using reduced shading rates. The resulting\nmodels account for 70%-90% of the variance while achieving up to an order of\nmagnitude faster computation times. Our solution combines image-space\ninformation that is readily available in most state-of-the-art deferred shading\npipelines with reprojection from previous frames to enable an adequate estimate\nof visual errors, even in previously unseen regions. We describe a suitable\nconvolutional network architecture and considerations for data preparation for\ntraining. We demonstrate the capability of our network to predict complex error\nmetrics at interactive rates in a real-time application that implements\ncontent-adaptive shading in a deferred pipeline. Depending on the portion of\nunseen image regions, our approach can achieve up to $2\\times$ performance\ncompared to state-of-the-art methods.\n","authors":["João Libório Cardoso","Bernhard Kerbl","Lei Yang","Yury Uralsky","Michael Wimmer"],"pdf_url":"https://arxiv.org/pdf/2310.09125v1.pdf","comment":"Published at Proceedings of the ACM in Computer Graphics and\n Interactive Techniques. 14 Pages, 16 Figures, 3 Tables. For paper website and\n higher quality figures, see https://jaliborc.github.io/rt-percept/"},{"id":"http://arxiv.org/abs/2310.09122v1","updated":"2023-10-13T14:11:33Z","published":"2023-10-13T14:11:33Z","title":"Equirectangular image construction method for standard CNNs for Semantic\n Segmentation","summary":" 360{\\deg} spherical images have advantages of wide view field, and are\ntypically projected on a planar plane for processing, which is known as\nequirectangular image. The object shape in equirectangular images can be\ndistorted and lack translation invariance. In addition, there are few publicly\ndataset of equirectangular images with labels, which presents a challenge for\nstandard CNNs models to process equirectangular images effectively. To tackle\nthis problem, we propose a methodology for converting a perspective image into\nequirectangular image. The inverse transformation of the spherical center\nprojection and the equidistant cylindrical projection are employed. This\nenables the standard CNNs to learn the distortion features at different\npositions in the equirectangular image and thereby gain the ability to\nsemantically the equirectangular image. The parameter, {\\phi}, which determines\nthe projection position of the perspective image, has been analyzed using\nvarious datasets and models, such as UNet, UNet++, SegNet, PSPNet, and DeepLab\nv3+. The experiments demonstrate that an optimal value of {\\phi} for effective\nsemantic segmentation of equirectangular images is 6{\\pi}/16 for standard CNNs.\nCompared with the other three types of methods (supervised learning,\nunsupervised learning and data augmentation), the method proposed in this paper\nhas the best average IoU value of 43.76%. This value is 23.85%, 10.7% and\n17.23% higher than those of other three methods, respectively.\n","authors":["Haoqian Chen","Jian Liu","Minghe Li","Kaiwen Jiang","Ziheng Xu","Rencheng Sun","Yi Sui"],"pdf_url":"https://arxiv.org/pdf/2310.09122v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2211.07273v6","updated":"2023-10-13T14:10:29Z","published":"2022-11-14T11:07:18Z","title":"MLIC: Multi-Reference Entropy Model for Learned Image Compression","summary":" Recently, learned image compression has achieved remarkable performance. The\nentropy model, which estimates the distribution of the latent representation,\nplays a crucial role in boosting rate-distortion performance. However, most\nentropy models only capture correlations in one dimension, while the latent\nrepresentation contain channel-wise, local spatial, and global spatial\ncorrelations. To tackle this issue, we propose the Multi-Reference Entropy\nModel (MEM) and the advanced version, MEM$^+$. These models capture the\ndifferent types of correlations present in latent representation. Specifically,\nWe first divide the latent representation into slices. When decoding the\ncurrent slice, we use previously decoded slices as context and employ the\nattention map of the previously decoded slice to predict global correlations in\nthe current slice. To capture local contexts, we introduce two enhanced\ncheckerboard context capturing techniques that avoids performance degradation.\nBased on MEM and MEM$^+$, we propose image compression models MLIC and\nMLIC$^+$. Extensive experimental evaluations demonstrate that our MLIC and\nMLIC$^+$ models achieve state-of-the-art performance, reducing BD-rate by\n$8.05\\%$ and $11.39\\%$ on the Kodak dataset compared to VTM-17.0 when measured\nin PSNR. Our code will be available at https://github.com/JiangWeibeta/MLIC.\n","authors":["Wei Jiang","Jiayu Yang","Yongqi Zhai","Peirong Ning","Feng Gao","Ronggang Wang"],"pdf_url":"https://arxiv.org/pdf/2211.07273v6.pdf","comment":"Accepted at ACMMM 2023"},{"id":"http://arxiv.org/abs/2309.09379v2","updated":"2023-10-13T14:04:54Z","published":"2023-09-17T21:06:22Z","title":"A Critical Analysis of Internal Reliability for Uncertainty\n Quantification of Dense Image Matching in Multi-view Stereo","summary":" Nowadays, photogrammetrically derived point clouds are widely used in many\ncivilian applications due to their low cost and flexibility in acquisition.\nTypically, photogrammetric point clouds are assessed through reference data\nsuch as LiDAR point clouds. However, when reference data are not available, the\nassessment of photogrammetric point clouds may be challenging. Since these\npoint clouds are algorithmically derived, their accuracies and precisions are\nhighly varying with the camera networks, scene complexity, and dense image\nmatching (DIM) algorithms, and there is no standard error metric to determine\nper-point errors. The theory of internal reliability of camera networks has\nbeen well studied through first-order error estimation of Bundle Adjustment\n(BA), which is used to understand the errors of 3D points assuming known\nmeasurement errors. However, the measurement errors of the DIM algorithms are\nintricate to an extent that every single point may have its error function\ndetermined by factors such as pixel intensity, texture entropy, and surface\nsmoothness. Despite the complexity, there exist a few common metrics that may\naid the process of estimating the posterior reliability of the derived points,\nespecially in a multi-view stereo (MVS) setup when redundancies are present. In\nthis paper, by using an aerial oblique photogrammetric block with LiDAR\nreference data, we analyze several internal matching metrics within a common\nMVS framework, including statistics in ray convergence, intersection angles,\nDIM energy, etc.\n","authors":["Debao Huang","Rongjun Qin"],"pdf_url":"https://arxiv.org/pdf/2309.09379v2.pdf","comment":"Figure 8"},{"id":"http://arxiv.org/abs/2310.09118v1","updated":"2023-10-13T14:03:01Z","published":"2023-10-13T14:03:01Z","title":"DSG: An End-to-End Document Structure Generator","summary":" Information in industry, research, and the public sector is widely stored as\nrendered documents (e.g., PDF files, scans). Hence, to enable downstream tasks,\nsystems are needed that map rendered documents onto a structured hierarchical\nformat. However, existing systems for this task are limited by heuristics and\nare not end-to-end trainable. In this work, we introduce the Document Structure\nGenerator (DSG), a novel system for document parsing that is fully end-to-end\ntrainable. DSG combines a deep neural network for parsing (i) entities in\ndocuments (e.g., figures, text blocks, headers, etc.) and (ii) relations that\ncapture the sequence and nested structure between entities. Unlike existing\nsystems that rely on heuristics, our DSG is trained end-to-end, making it\neffective and flexible for real-world applications. We further contribute a\nnew, large-scale dataset called E-Periodica comprising real-world magazines\nwith complex document structures for evaluation. Our results demonstrate that\nour DSG outperforms commercial OCR tools and, on top of that, achieves\nstate-of-the-art performance. To the best of our knowledge, our DSG system is\nthe first end-to-end trainable system for hierarchical document parsing.\n","authors":["Johannes Rausch","Gentiana Rashiti","Maxim Gusev","Ce Zhang","Stefan Feuerriegel"],"pdf_url":"https://arxiv.org/pdf/2310.09118v1.pdf","comment":"Accepted at ICDM 2023"},{"id":"http://arxiv.org/abs/2310.09114v1","updated":"2023-10-13T14:00:49Z","published":"2023-10-13T14:00:49Z","title":"Timestamp-supervised Wearable-based Activity Segmentation and\n Recognition with Contrastive Learning and Order-Preserving Optimal Transport","summary":" Human activity recognition (HAR) with wearables is one of the serviceable\ntechnologies in ubiquitous and mobile computing applications. The\nsliding-window scheme is widely adopted while suffering from the multi-class\nwindows problem. As a result, there is a growing focus on joint segmentation\nand recognition with deep-learning methods, aiming at simultaneously dealing\nwith HAR and time-series segmentation issues. However, obtaining the full\nactivity annotations of wearable data sequences is resource-intensive or\ntime-consuming, while unsupervised methods yield poor performance. To address\nthese challenges, we propose a novel method for joint activity segmentation and\nrecognition with timestamp supervision, in which only a single annotated sample\nis needed in each activity segment. However, the limited information of sparse\nannotations exacerbates the gap between recognition and segmentation tasks,\nleading to sub-optimal model performance. Therefore, the prototypes are\nestimated by class-activation maps to form a sample-to-prototype contrast\nmodule for well-structured embeddings. Moreover, with the optimal transport\ntheory, our approach generates the sample-level pseudo-labels that take\nadvantage of unlabeled data between timestamp annotations for further\nperformance improvement. Comprehensive experiments on four public HAR datasets\ndemonstrate that our model trained with timestamp supervision is superior to\nthe state-of-the-art weakly-supervised methods and achieves comparable\nperformance to the fully-supervised approaches.\n","authors":["Songpengcheng Xia","Lei Chu","Ling Pei","Jiarui Yang","Wenxian Yu","Robert C. Qiu"],"pdf_url":"https://arxiv.org/pdf/2310.09114v1.pdf","comment":"Under Review (submitted to IEEE TMC)"},{"id":"http://arxiv.org/abs/2211.08007v2","updated":"2023-10-13T14:00:48Z","published":"2022-11-15T09:42:07Z","title":"Uncertainty-aware Gait Recognition via Learning from Dirichlet\n Distribution-based Evidence","summary":" Existing gait recognition frameworks retrieve an identity in the gallery\nbased on the distance between a probe sample and the identities in the gallery.\nHowever, existing methods often neglect that the gallery may not contain\nidentities corresponding to the probes, leading to recognition errors rather\nthan raising an alarm. In this paper, we introduce a novel uncertainty-aware\ngait recognition method that models the uncertainty of identification based on\nlearned evidence. Specifically, we treat our recognition model as an evidence\ncollector to gather evidence from input samples and parameterize a Dirichlet\ndistribution over the evidence. The Dirichlet distribution essentially\nrepresents the density of the probability assigned to the input samples. We\nutilize the distribution to evaluate the resultant uncertainty of each probe\nsample and then determine whether a probe has a counterpart in the gallery or\nnot. To the best of our knowledge, our method is the first attempt to tackle\ngait recognition with uncertainty modelling. Moreover, our uncertain modeling\nsignificantly improves the robustness against out-of-distribution (OOD)\nqueries. Extensive experiments demonstrate that our method achieves\nstate-of-the-art performance on datasets with OOD queries, and can also\ngeneralize well to other identity-retrieval tasks. Importantly, our method\noutperforms the state-of-the-art by a large margin of 51.26% when the OOD query\nrate is around 50% on OUMVLP.\n","authors":["Beibei Lin","Chen Liu","Ming Wang","Lincheng Li","Shunli Zhang","Robby T. Tan","Xin Yu"],"pdf_url":"https://arxiv.org/pdf/2211.08007v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.09675v3","updated":"2023-10-13T13:48:58Z","published":"2023-06-16T08:13:41Z","title":"Multi-View Class Incremental Learning","summary":" Multi-view learning (MVL) has gained great success in integrating information\nfrom multiple perspectives of a dataset to improve downstream task performance.\nTo make MVL methods more practical in an open-ended environment, this paper\ninvestigates a novel paradigm called multi-view class incremental learning\n(MVCIL), where a single model incrementally classifies new classes from a\ncontinual stream of views, requiring no access to earlier views of data.\nHowever, MVCIL is challenged by the catastrophic forgetting of old information\nand the interference with learning new concepts. To address this, we first\ndevelop a randomization-based representation learning technique serving for\nfeature extraction to guarantee their separate view-optimal working states,\nduring which multiple views belonging to a class are presented sequentially;\nThen, we integrate them one by one in the orthogonality fusion subspace spanned\nby the extracted features; Finally, we introduce selective weight consolidation\nfor learning-without-forgetting decision-making while encountering new classes.\nExtensive experiments on synthetic and real-world datasets validate the\neffectiveness of our approach.\n","authors":["Depeng Li","Tianqi Wang","Junwei Chen","Kenji Kawaguchi","Cheng Lian","Zhigang Zeng"],"pdf_url":"https://arxiv.org/pdf/2306.09675v3.pdf","comment":"Accepted to Information Fusion"},{"id":"http://arxiv.org/abs/2310.02071v2","updated":"2023-10-13T13:43:53Z","published":"2023-10-03T14:13:36Z","title":"Towards End-to-End Embodied Decision Making via Multi-modal Large\n Language Model: Explorations with GPT4-Vision and Beyond","summary":" In this study, we explore the potential of Multimodal Large Language Models\n(MLLMs) in improving embodied decision-making processes for agents. While Large\nLanguage Models (LLMs) have been widely used due to their advanced reasoning\nskills and vast world knowledge, MLLMs like GPT4-Vision offer enhanced visual\nunderstanding and reasoning capabilities. We investigate whether\nstate-of-the-art MLLMs can handle embodied decision-making in an end-to-end\nmanner and whether collaborations between LLMs and MLLMs can enhance\ndecision-making. To address these questions, we introduce a new benchmark\ncalled PCA-EVAL, which evaluates embodied decision-making from the perspectives\nof Perception, Cognition, and Action. Additionally, we propose HOLMES, a\nmulti-agent cooperation framework that allows LLMs to leverage MLLMs and APIs\nto gather multimodal information for informed decision-making. We compare\nend-to-end embodied decision-making and HOLMES on our benchmark and find that\nthe GPT4-Vision model demonstrates strong end-to-end embodied decision-making\nabilities, outperforming GPT4-HOLMES in terms of average decision accuracy\n(+3%). However, this performance is exclusive to the latest GPT4-Vision model,\nsurpassing the open-source state-of-the-art MLLM by 26%. Our results indicate\nthat powerful MLLMs like GPT4-Vision hold promise for decision-making in\nembodied agents, offering new avenues for MLLM research. Code and data are open\nat https://github.com/pkunlp-icler/PCA-EVAL/.\n","authors":["Liang Chen","Yichi Zhang","Shuhuai Ren","Haozhe Zhao","Zefan Cai","Yuchi Wang","Peiyi Wang","Tianyu Liu","Baobao Chang"],"pdf_url":"https://arxiv.org/pdf/2310.02071v2.pdf","comment":"18 pages, 10 figures, Code and data:\n https://github.com/pkunlp-icler/PCA-EVAL/"},{"id":"http://arxiv.org/abs/2304.03550v2","updated":"2023-10-13T13:38:26Z","published":"2023-04-07T09:11:29Z","title":"Hierarchical Disentanglement-Alignment Network for Robust SAR Vehicle\n Recognition","summary":" Vehicle recognition is a fundamental problem in SAR image interpretation.\nHowever, robustly recognizing vehicle targets is a challenging task in SAR due\nto the large intraclass variations and small interclass variations.\nAdditionally, the lack of large datasets further complicates the task. Inspired\nby the analysis of target signature variations and deep learning\nexplainability, this paper proposes a novel domain alignment framework named\nthe Hierarchical Disentanglement-Alignment Network (HDANet) to achieve\nrobustness under various operating conditions. Concisely, HDANet integrates\nfeature disentanglement and alignment into a unified framework with three\nmodules: domain data generation, multitask-assisted mask disentanglement, and\ndomain alignment of target features. The first module generates diverse data\nfor alignment, and three simple but effective data augmentation methods are\ndesigned to simulate target signature variations. The second module\ndisentangles the target features from background clutter using the\nmultitask-assisted mask to prevent clutter from interfering with subsequent\nalignment. The third module employs a contrastive loss for domain alignment to\nextract robust target features from generated diverse data and disentangled\nfeatures. Lastly, the proposed method demonstrates impressive robustness across\nnine operating conditions in the MSTAR dataset, and extensive qualitative and\nquantitative analyses validate the effectiveness of our framework.\n","authors":["Weijie Li","Wei Yang","Wenpeng Zhang","Tianpeng Liu","Yongxiang Liu","Li Liu"],"pdf_url":"https://arxiv.org/pdf/2304.03550v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08586v2","updated":"2023-10-13T13:37:57Z","published":"2023-10-12T17:59:57Z","title":"PonderV2: Pave the Way for 3D Foundation Model with A Universal\n Pre-training Paradigm","summary":" In contrast to numerous NLP and 2D computer vision foundational models, the\nlearning of a robust and highly generalized 3D foundational model poses\nconsiderably greater challenges. This is primarily due to the inherent data\nvariability and the diversity of downstream tasks. In this paper, we introduce\na comprehensive 3D pre-training framework designed to facilitate the\nacquisition of efficient 3D representations, thereby establishing a pathway to\n3D foundational models. Motivated by the fact that informative 3D features\nshould be able to encode rich geometry and appearance cues that can be utilized\nto render realistic images, we propose a novel universal paradigm to learn\npoint cloud representations by differentiable neural rendering, serving as a\nbridge between 3D and 2D worlds. We train a point cloud encoder within a\ndevised volumetric neural renderer by comparing the rendered images with the\nreal images. Notably, our approach demonstrates the seamless integration of the\nlearned 3D encoder into diverse downstream tasks. These tasks encompass not\nonly high-level challenges such as 3D detection and segmentation but also\nlow-level objectives like 3D reconstruction and image synthesis, spanning both\nindoor and outdoor scenarios. Besides, we also illustrate the capability of\npre-training a 2D backbone using the proposed universal methodology, surpassing\nconventional pre-training methods by a large margin. For the first time,\nPonderV2 achieves state-of-the-art performance on 11 indoor and outdoor\nbenchmarks. The consistent improvements in various settings imply the\neffectiveness of the proposed method. Code and models will be made available at\nhttps://github.com/OpenGVLab/PonderV2.\n","authors":["Haoyi Zhu","Honghui Yang","Xiaoyang Wu","Di Huang","Sha Zhang","Xianglong He","Tong He","Hengshuang Zhao","Chunhua Shen","Yu Qiao","Wanli Ouyang"],"pdf_url":"https://arxiv.org/pdf/2310.08586v2.pdf","comment":"arXiv admin note: text overlap with arXiv:2301.00157"},{"id":"http://arxiv.org/abs/2310.09099v1","updated":"2023-10-13T13:35:19Z","published":"2023-10-13T13:35:19Z","title":"Faster 3D cardiac CT segmentation with Vision Transformers","summary":" Accurate segmentation of the heart is essential for personalized blood flow\nsimulations and surgical intervention planning. A recent advancement in image\nrecognition is the Vision Transformer (ViT), which expands the field of view to\nencompass a greater portion of the global image context. We adapted ViT for\nthree-dimensional volume inputs. Cardiac computed tomography (CT) volumes from\n39 patients, featuring up to 20 timepoints representing the complete cardiac\ncycle, were utilized. Our network incorporates a modified ResNet50 block as\nwell as a ViT block and employs cascade upsampling with skip connections.\nDespite its increased model complexity, our hybrid Transformer-Residual U-Net\nframework, termed TRUNet, converges in significantly less time than residual\nU-Net while providing comparable or superior segmentations of the left\nventricle, left atrium, left atrial appendage, ascending aorta, and pulmonary\nveins. TRUNet offers more precise vessel boundary segmentation and better\ncaptures the heart's overall anatomical structure compared to residual U-Net,\nas confirmed by the absence of extraneous clusters of missegmented voxels. In\nterms of both performance and training speed, TRUNet exceeded U-Net, a commonly\nused segmentation architecture, making it a promising tool for 3D semantic\nsegmentation tasks in medical imaging. The code for TRUNet is available at\ngithub.com/ljollans/TRUNet.\n","authors":["Lee Jollans","Mariana Bustamante","Lilian Henriksson","Anders Persson","Tino Ebbers"],"pdf_url":"https://arxiv.org/pdf/2310.09099v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09092v1","updated":"2023-10-13T13:24:37Z","published":"2023-10-13T13:24:37Z","title":"iPUNet:Iterative Cross Field Guided Point Cloud Upsampling","summary":" Point clouds acquired by 3D scanning devices are often sparse, noisy, and\nnon-uniform, causing a loss of geometric features. To facilitate the usability\nof point clouds in downstream applications, given such input, we present a\nlearning-based point upsampling method, i.e., iPUNet, which generates dense and\nuniform points at arbitrary ratios and better captures sharp features. To\ngenerate feature-aware points, we introduce cross fields that are aligned to\nsharp geometric features by self-supervision to guide point generation. Given\ncross field defined frames, we enable arbitrary ratio upsampling by learning at\neach input point a local parameterized surface. The learned surface consumes\nthe neighboring points and 2D tangent plane coordinates as input, and maps onto\na continuous surface in 3D where arbitrary ratios of output points can be\nsampled. To solve the non-uniformity of input points, on top of the cross field\nguided upsampling, we further introduce an iterative strategy that refines the\npoint distribution by moving sparse points onto the desired continuous 3D\nsurface in each iteration. Within only a few iterations, the sparse points are\nevenly distributed and their corresponding dense samples are more uniform and\nbetter capture geometric features. Through extensive evaluations on diverse\nscans of objects and scenes, we demonstrate that iPUNet is robust to handle\nnoisy and non-uniformly distributed inputs, and outperforms state-of-the-art\npoint cloud upsampling methods.\n","authors":["Guangshun Wei","Hao Pan","Shaojie Zhuang","Yuanfeng Zhou","Changjian Li"],"pdf_url":"https://arxiv.org/pdf/2310.09092v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.10474v2","updated":"2023-10-13T13:05:26Z","published":"2023-06-18T04:34:17Z","title":"A Universal Semantic-Geometric Representation for Robotic Manipulation","summary":" Robots rely heavily on sensors, especially RGB and depth cameras, to perceive\nand interact with the world. RGB cameras record 2D images with rich semantic\ninformation while missing precise spatial information. On the other side, depth\ncameras offer critical 3D geometry data but capture limited semantics.\nTherefore, integrating both modalities is crucial for learning representations\nfor robotic perception and control. However, current research predominantly\nfocuses on only one of these modalities, neglecting the benefits of\nincorporating both. To this end, we present $\\textbf{Semantic-Geometric\nRepresentation} (\\textbf{SGR})$, a universal perception module for robotics\nthat leverages the rich semantic information of large-scale pre-trained 2D\nmodels and inherits the merits of 3D spatial reasoning. Our experiments\ndemonstrate that SGR empowers the agent to successfully complete a diverse\nrange of simulated and real-world robotic manipulation tasks, outperforming\nstate-of-the-art methods significantly in both single-task and multi-task\nsettings. Furthermore, SGR possesses the capability to generalize to novel\nsemantic attributes, setting it apart from the other methods. Project website:\nhttps://semantic-geometric-representation.github.io.\n","authors":["Tong Zhang","Yingdong Hu","Hanchen Cui","Hang Zhao","Yang Gao"],"pdf_url":"https://arxiv.org/pdf/2306.10474v2.pdf","comment":"CoRL 2023. Project website:\n https://semantic-geometric-representation.github.io"},{"id":"http://arxiv.org/abs/2304.12372v3","updated":"2023-10-13T12:58:41Z","published":"2023-04-24T18:10:25Z","title":"Beyond the Pixel: a Photometrically Calibrated HDR Dataset for Luminance\n and Color Prediction","summary":" Light plays an important role in human well-being. However, most computer\nvision tasks treat pixels without considering their relationship to physical\nluminance. To address this shortcoming, we introduce the Laval Photometric\nIndoor HDR Dataset, the first large-scale photometrically calibrated dataset of\nhigh dynamic range 360{\\deg} panoramas. Our key contribution is the calibration\nof an existing, uncalibrated HDR Dataset. We do so by accurately capturing RAW\nbracketed exposures simultaneously with a professional photometric measurement\ndevice (chroma meter) for multiple scenes across a variety of lighting\nconditions. Using the resulting measurements, we establish the calibration\ncoefficients to be applied to the HDR images. The resulting dataset is a rich\nrepresentation of indoor scenes which displays a wide range of illuminance and\ncolor, and varied types of light sources. We exploit the dataset to introduce\nthree novel tasks, where: per-pixel luminance, per-pixel color and planar\nilluminance can be predicted from a single input image. Finally, we also\ncapture another smaller photometric dataset with a commercial 360{\\deg} camera,\nto experiment on generalization across cameras. We are optimistic that the\nrelease of our datasets and associated code will spark interest in physically\naccurate light estimation within the community. Dataset and code are available\nat https://lvsn.github.io/beyondthepixel/.\n","authors":["Christophe Bolduc","Justine Giroux","Marc Hébert","Claude Demers","Jean-François Lalonde"],"pdf_url":"https://arxiv.org/pdf/2304.12372v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.15833v2","updated":"2023-10-13T12:49:35Z","published":"2023-03-28T09:05:15Z","title":"Complementary Domain Adaptation and Generalization for Unsupervised\n Continual Domain Shift Learning","summary":" Continual domain shift poses a significant challenge in real-world\napplications, particularly in situations where labeled data is not available\nfor new domains. The challenge of acquiring knowledge in this problem setting\nis referred to as unsupervised continual domain shift learning. Existing\nmethods for domain adaptation and generalization have limitations in addressing\nthis issue, as they focus either on adapting to a specific domain or\ngeneralizing to unseen domains, but not both. In this paper, we propose\nComplementary Domain Adaptation and Generalization (CoDAG), a simple yet\neffective learning framework that combines domain adaptation and generalization\nin a complementary manner to achieve three major goals of unsupervised\ncontinual domain shift learning: adapting to a current domain, generalizing to\nunseen domains, and preventing forgetting of previously seen domains. Our\napproach is model-agnostic, meaning that it is compatible with any existing\ndomain adaptation and generalization algorithms. We evaluate CoDAG on several\nbenchmark datasets and demonstrate that our model outperforms state-of-the-art\nmodels in all datasets and evaluation metrics, highlighting its effectiveness\nand robustness in handling unsupervised continual domain shift learning.\n","authors":["Wonguk Cho","Jinha Park","Taesup Kim"],"pdf_url":"https://arxiv.org/pdf/2303.15833v2.pdf","comment":"ICCV 2023"},{"id":"http://arxiv.org/abs/2304.06385v4","updated":"2023-10-13T12:43:31Z","published":"2023-04-13T10:37:41Z","title":"TransHP: Image Classification with Hierarchical Prompting","summary":" This paper explores a hierarchical prompting mechanism for the hierarchical\nimage classification (HIC) task. Different from prior HIC methods, our\nhierarchical prompting is the first to explicitly inject ancestor-class\ninformation as a tokenized hint that benefits the descendant-class\ndiscrimination. We think it well imitates human visual recognition, i.e.,\nhumans may use the ancestor class as a prompt to draw focus on the subtle\ndifferences among descendant classes. We model this prompting mechanism into a\nTransformer with Hierarchical Prompting (TransHP). TransHP consists of three\nsteps: 1) learning a set of prompt tokens to represent the coarse (ancestor)\nclasses, 2) on-the-fly predicting the coarse class of the input image at an\nintermediate block, and 3) injecting the prompt token of the predicted coarse\nclass into the intermediate feature. Though the parameters of TransHP maintain\nthe same for all input images, the injected coarse-class prompt conditions\n(modifies) the subsequent feature extraction and encourages a dynamic focus on\nrelatively subtle differences among the descendant classes. Extensive\nexperiments show that TransHP improves image classification on accuracy (e.g.,\nimproving ViT-B/16 by +2.83% ImageNet classification accuracy), training data\nefficiency (e.g., +12.69% improvement under 10% ImageNet training data), and\nmodel explainability. Moreover, TransHP also performs favorably against prior\nHIC methods, showing that TransHP well exploits the hierarchical information.\n","authors":["Wenhao Wang","Yifan Sun","Wei Li","Yi Yang"],"pdf_url":"https://arxiv.org/pdf/2304.06385v4.pdf","comment":"Accepted to NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.09066v1","updated":"2023-10-13T12:41:28Z","published":"2023-10-13T12:41:28Z","title":"pose-format: Library for Viewing, Augmenting, and Handling .pose Files","summary":" Managing and analyzing pose data is a complex task, with challenges ranging\nfrom handling diverse file structures and data types to facilitating effective\ndata manipulations such as normalization and augmentation. This paper presents\n\\texttt{pose-format}, a comprehensive toolkit designed to address these\nchallenges by providing a unified, flexible, and easy-to-use interface. The\nlibrary includes a specialized file format that encapsulates various types of\npose data, accommodating multiple individuals and an indefinite number of time\nframes, thus proving its utility for both image and video data. Furthermore, it\noffers seamless integration with popular numerical libraries such as NumPy,\nPyTorch, and TensorFlow, thereby enabling robust machine-learning applications.\nThrough benchmarking, we demonstrate that our \\texttt{.pose} file format offers\nvastly superior performance against prevalent formats like OpenPose, with added\nadvantages like self-contained pose specification. Additionally, the library\nincludes features for data normalization, augmentation, and easy-to-use\nvisualization capabilities, both in Python and Browser environments.\n\\texttt{pose-format} emerges as a one-stop solution, streamlining the\ncomplexities of pose data management and analysis.\n","authors":["Amit Moryossef","Mathias Müller","Rebecka Fahrni"],"pdf_url":"https://arxiv.org/pdf/2310.09066v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09016v1","updated":"2023-10-13T11:25:41Z","published":"2023-10-13T11:25:41Z","title":"A Spatial-Temporal Dual-Mode Mixed Flow Network for Panoramic Video\n Salient Object Detection","summary":" Salient object detection (SOD) in panoramic video is still in the initial\nexploration stage. The indirect application of 2D video SOD method to the\ndetection of salient objects in panoramic video has many unmet challenges, such\nas low detection accuracy, high model complexity, and poor generalization\nperformance. To overcome these hurdles, we design an Inter-Layer Attention\n(ILA) module, an Inter-Layer weight (ILW) module, and a Bi-Modal Attention\n(BMA) module. Based on these modules, we propose a Spatial-Temporal Dual-Mode\nMixed Flow Network (STDMMF-Net) that exploits the spatial flow of panoramic\nvideo and the corresponding optical flow for SOD. First, the ILA module\ncalculates the attention between adjacent level features of consecutive frames\nof panoramic video to improve the accuracy of extracting salient object\nfeatures from the spatial flow. Then, the ILW module quantifies the salient\nobject information contained in the features of each level to improve the\nfusion efficiency of the features of each level in the mixed flow. Finally, the\nBMA module improves the detection accuracy of STDMMF-Net. A large number of\nsubjective and objective experimental results testify that the proposed method\ndemonstrates better detection accuracy than the state-of-the-art (SOTA)\nmethods. Moreover, the comprehensive performance of the proposed method is\nbetter in terms of memory required for model inference, testing time,\ncomplexity, and generalization performance.\n","authors":["Xiaolei Chen","Pengcheng Zhang","Zelong Du","Ishfaq Ahmad"],"pdf_url":"https://arxiv.org/pdf/2310.09016v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.05000v2","updated":"2023-10-13T11:07:09Z","published":"2023-07-11T03:40:10Z","title":"Neural Point-based Volumetric Avatar: Surface-guided Neural Points for\n Efficient and Photorealistic Volumetric Head Avatar","summary":" Rendering photorealistic and dynamically moving human heads is crucial for\nensuring a pleasant and immersive experience in AR/VR and video conferencing\napplications. However, existing methods often struggle to model challenging\nfacial regions (e.g., mouth interior, eyes, hair/beard), resulting in\nunrealistic and blurry results. In this paper, we propose {\\fullname}\n({\\name}), a method that adopts the neural point representation as well as the\nneural volume rendering process and discards the predefined connectivity and\nhard correspondence imposed by mesh-based approaches. Specifically, the neural\npoints are strategically constrained around the surface of the target\nexpression via a high-resolution UV displacement map, achieving increased\nmodeling capacity and more accurate control. We introduce three technical\ninnovations to improve the rendering and training efficiency: a patch-wise\ndepth-guided (shading point) sampling strategy, a lightweight radiance decoding\nprocess, and a Grid-Error-Patch (GEP) ray sampling strategy during training. By\ndesign, our {\\name} is better equipped to handle topologically changing regions\nand thin structures while also ensuring accurate expression control when\nanimating avatars. Experiments conducted on three subjects from the Multiface\ndataset demonstrate the effectiveness of our designs, outperforming previous\nstate-of-the-art methods, especially in handling challenging facial regions.\n","authors":["Cong Wang","Di Kang","Yan-Pei Cao","Linchao Bao","Ying Shan","Song-Hai Zhang"],"pdf_url":"https://arxiv.org/pdf/2307.05000v2.pdf","comment":"Accepted by SIGGRAPH Asia 2023"},{"id":"http://arxiv.org/abs/2309.14976v3","updated":"2023-10-13T10:36:14Z","published":"2023-09-26T14:52:51Z","title":"MoCaE: Mixture of Calibrated Experts Significantly Improves Object\n Detection","summary":" We propose an extremely simple and highly effective approach to faithfully\ncombine different object detectors to obtain a Mixture of Experts (MoE) that\nhas a superior accuracy to the individual experts in the mixture. We find that\nnaively combining these experts in a similar way to the well-known Deep\nEnsembles (DEs), does not result in an effective MoE. We identify the\nincompatibility between the confidence score distribution of different\ndetectors to be the primary reason for such failure cases. Therefore, to\nconstruct the MoE, our proposal is to first calibrate each individual detector\nagainst a target calibration function. Then, filter and refine all the\npredictions from different detectors in the mixture. We term this approach as\nMoCaE and demonstrate its effectiveness through extensive experiments on object\ndetection, instance segmentation and rotated object detection tasks.\nSpecifically, MoCaE improves (i) three strong object detectors on COCO test-dev\nby $2.4$ $\\mathrm{AP}$ by reaching $59.0$ $\\mathrm{AP}$; (ii) instance\nsegmentation methods on the challenging long-tailed LVIS dataset by $2.3$\n$\\mathrm{AP}$; and (iii) all existing rotated object detectors by reaching\n$82.62$ $\\mathrm{AP_{50}}$ on DOTA dataset, establishing a new state-of-the-art\n(SOTA). Code will be made public.\n","authors":["Kemal Oksuz","Selim Kuzucu","Tom Joy","Puneet K. Dokania"],"pdf_url":"https://arxiv.org/pdf/2309.14976v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08986v1","updated":"2023-10-13T10:06:39Z","published":"2023-10-13T10:06:39Z","title":"VCL Challenges 2023 at ICCV 2023 Technical Report: Bi-level Adaptation\n Method for Test-time Adaptive Object Detection","summary":" This report outlines our team's participation in VCL Challenges B Continual\nTest_time Adaptation, focusing on the technical details of our approach. Our\nprimary focus is Testtime Adaptation using bi_level adaptations, encompassing\nimage_level and detector_level adaptations. At the image level, we employ\nadjustable parameterbased image filters, while at the detector level, we\nleverage adjustable parameterbased mean teacher modules. Ultimately, through\nthe utilization of these bi_level adaptations, we have achieved a remarkable\n38.3% mAP on the target domain of the test set within VCL Challenges B. It is\nworth noting that the minimal drop in mAP, is mearly 4.2%, and the overall\nperformance is 32.5% mAP.\n","authors":["Chenyu Lin","Yusheng He","Zhengqing Zang","Chenwei Tang","Tao Wang","Jiancheng Lv"],"pdf_url":"https://arxiv.org/pdf/2310.08986v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08984v1","updated":"2023-10-13T10:03:01Z","published":"2023-10-13T10:03:01Z","title":"UniParser: Multi-Human Parsing with Unified Correlation Representation\n Learning","summary":" Multi-human parsing is an image segmentation task necessitating both\ninstance-level and fine-grained category-level information. However, prior\nresearch has typically processed these two types of information through\nseparate branches and distinct output formats, leading to inefficient and\nredundant frameworks. This paper introduces UniParser, which integrates\ninstance-level and category-level representations in three key aspects: 1) we\npropose a unified correlation representation learning approach, allowing our\nnetwork to learn instance and category features within the cosine space; 2) we\nunify the form of outputs of each modules as pixel-level segmentation results\nwhile supervising instance and category features using a homogeneous label\naccompanied by an auxiliary loss; and 3) we design a joint optimization\nprocedure to fuse instance and category representations. By virtual of unifying\ninstance-level and category-level output, UniParser circumvents manually\ndesigned post-processing techniques and surpasses state-of-the-art methods,\nachieving 49.3% AP on MHPv2.0 and 60.4% AP on CIHP. We will release our source\ncode, pretrained models, and online demos to facilitate future studies.\n","authors":["Jiaming Chu","Lei Jin","Junliang Xing","Jian Zhao"],"pdf_url":"https://arxiv.org/pdf/2310.08984v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08371v2","updated":"2023-10-13T09:20:09Z","published":"2023-10-12T14:40:24Z","title":"Worst-Case Morphs using Wasserstein ALI and Improved MIPGAN","summary":" A morph is a combination of two separate facial images and contains identity\ninformation of two different people. When used in an identity document, both\npeople can be authenticated by a biometric Face Recognition (FR) system. Morphs\ncan be generated using either a landmark-based approach or approaches based on\ndeep learning such as Generative Adversarial Networks (GAN). In a recent paper,\nwe introduced a \\emph{worst-case} upper bound on how challenging morphing\nattacks can be for an FR system. The closer morphs are to this upper bound, the\nbigger the challenge they pose to FR. We introduced an approach with which it\nwas possible to generate morphs that approximate this upper bound for a known\nFR system (white box), but not for unknown (black box) FR systems.\n In this paper, we introduce a morph generation method that can approximate\nworst-case morphs even when the FR system is not known. A key contribution is\nthat we include the goal of generating difficult morphs \\emph{during} training.\nOur method is based on Adversarially Learned Inference (ALI) and uses concepts\nfrom Wasserstein GANs trained with Gradient Penalty, which were introduced to\nstabilise the training of GANs. We include these concepts to achieve similar\nimprovement in training stability and call the resulting method Wasserstein ALI\n(WALI). We finetune WALI using loss functions designed specifically to improve\nthe ability to manipulate identity information in facial images and show how it\ncan generate morphs that are more challenging for FR systems than landmark- or\nGAN-based morphs. We also show how our findings can be used to improve MIPGAN,\nan existing StyleGAN-based morph generator.\n","authors":["Una M. Kelly","Meike Nauta","Lu Liu","Luuk J. Spreeuwers","Raymond N. J. Veldhuis"],"pdf_url":"https://arxiv.org/pdf/2310.08371v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08956v1","updated":"2023-10-13T09:04:52Z","published":"2023-10-13T09:04:52Z","title":"LRRU: Long-short Range Recurrent Updating Networks for Depth Completion","summary":" Existing deep learning-based depth completion methods generally employ\nmassive stacked layers to predict the dense depth map from sparse input data.\nAlthough such approaches greatly advance this task, their accompanied huge\ncomputational complexity hinders their practical applications. To accomplish\ndepth completion more efficiently, we propose a novel lightweight deep network\nframework, the Long-short Range Recurrent Updating (LRRU) network. Without\nlearning complex feature representations, LRRU first roughly fills the sparse\ninput to obtain an initial dense depth map, and then iteratively updates it\nthrough learned spatially-variant kernels. Our iterative update process is\ncontent-adaptive and highly flexible, where the kernel weights are learned by\njointly considering the guidance RGB images and the depth map to be updated,\nand large-to-small kernel scopes are dynamically adjusted to capture\nlong-to-short range dependencies. Our initial depth map has coarse but complete\nscene depth information, which helps relieve the burden of directly regressing\nthe dense depth from sparse ones, while our proposed method can effectively\nrefine it to an accurate depth map with less learnable parameters and inference\ntime. Experimental results demonstrate that our proposed LRRU variants achieve\nstate-of-the-art performance across different parameter regimes. In particular,\nthe LRRU-Base model outperforms competing approaches on the NYUv2 dataset, and\nranks 1st on the KITTI depth completion benchmark at the time of submission.\nProject page: https://npucvr.github.io/LRRU/.\n","authors":["Yufei Wang","Bo Li","Ge Zhang","Qi Liu","Tao Gao","Yuchao Dai"],"pdf_url":"https://arxiv.org/pdf/2310.08956v1.pdf","comment":"Published in ICCV 2023"},{"id":"http://arxiv.org/abs/2309.01429v2","updated":"2023-10-13T08:42:01Z","published":"2023-09-04T08:23:31Z","title":"Adapting Segment Anything Model for Change Detection in HR Remote\n Sensing Images","summary":" Vision Foundation Models (VFMs) such as the Segment Anything Model (SAM)\nallow zero-shot or interactive segmentation of visual contents, thus they are\nquickly applied in a variety of visual scenes. However, their direct use in\nmany Remote Sensing (RS) applications is often unsatisfactory due to the\nspecial imaging characteristics of RS images. In this work, we aim to utilize\nthe strong visual recognition capabilities of VFMs to improve the change\ndetection of high-resolution Remote Sensing Images (RSIs). We employ the visual\nencoder of FastSAM, an efficient variant of the SAM, to extract visual\nrepresentations in RS scenes. To adapt FastSAM to focus on some specific ground\nobjects in the RS scenes, we propose a convolutional adaptor to aggregate the\ntask-oriented change information. Moreover, to utilize the semantic\nrepresentations that are inherent to SAM features, we introduce a task-agnostic\nsemantic learning branch to model the semantic latent in bi-temporal RSIs. The\nresulting method, SAMCD, obtains superior accuracy compared to the SOTA methods\nand exhibits a sample-efficient learning ability that is comparable to\nsemi-supervised CD methods. To the best of our knowledge, this is the first\nwork that adapts VFMs for the CD of HR RSIs.\n","authors":["Lei Ding","Kun Zhu","Daifeng Peng","Hao Tang","Kuiwu Yang","Lorenzo Bruzzone"],"pdf_url":"https://arxiv.org/pdf/2309.01429v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08949v1","updated":"2023-10-13T08:38:56Z","published":"2023-10-13T08:38:56Z","title":"Making Multimodal Generation Easier: When Diffusion Models Meet LLMs","summary":" We present EasyGen, an efficient model designed to enhance multimodal\nunderstanding and generation by harnessing the capabilities of diffusion models\nand large language models (LLMs). Unlike existing multimodal models that\npredominately depend on encoders like CLIP or ImageBind and need ample amounts\nof training data to bridge the gap between modalities, EasyGen is built upon a\nbidirectional conditional diffusion model named BiDiffuser, which promotes more\nefficient interactions between modalities. EasyGen handles image-to-text\ngeneration by integrating BiDiffuser and an LLM via a simple projection layer.\nUnlike most existing multimodal models that are limited to generating text\nresponses, EasyGen can also facilitate text-to-image generation by leveraging\nthe LLM to create textual descriptions, which can be interpreted by BiDiffuser\nto generate appropriate visual responses. Extensive quantitative and\nqualitative experiments demonstrate the effectiveness of EasyGen, whose\ntraining can be easily achieved in a lab setting. The source code is available\nat https://github.com/zxy556677/EasyGen.\n","authors":["Xiangyu Zhao","Bo Liu","Qijiong Liu","Guangyuan Shi","Xiao-Ming Wu"],"pdf_url":"https://arxiv.org/pdf/2310.08949v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08948v1","updated":"2023-10-13T08:35:02Z","published":"2023-10-13T08:35:02Z","title":"Federated Class-Incremental Learning with Prompting","summary":" As Web technology continues to develop, it has become increasingly common to\nuse data stored on different clients. At the same time, federated learning has\nreceived widespread attention due to its ability to protect data privacy when\nlet models learn from data which is distributed across various clients.\nHowever, most existing works assume that the client's data are fixed. In\nreal-world scenarios, such an assumption is most likely not true as data may be\ncontinuously generated and new classes may also appear. To this end, we focus\non the practical and challenging federated class-incremental learning (FCIL)\nproblem. For FCIL, the local and global models may suffer from catastrophic\nforgetting on old classes caused by the arrival of new classes and the data\ndistributions of clients are non-independent and identically distributed\n(non-iid).\n In this paper, we propose a novel method called Federated Class-Incremental\nLearning with PrompTing (FCILPT). Given the privacy and limited memory, FCILPT\ndoes not use a rehearsal-based buffer to keep exemplars of old data. We choose\nto use prompts to ease the catastrophic forgetting of the old classes.\nSpecifically, we encode the task-relevant and task-irrelevant knowledge into\nprompts, preserving the old and new knowledge of the local clients and solving\nthe problem of catastrophic forgetting. We first sort the task information in\nthe prompt pool in the local clients to align the task information on different\nclients before global aggregation. It ensures that the same task's knowledge\nare fully integrated, solving the problem of non-iid caused by the lack of\nclasses among different clients in the same incremental task. Experiments on\nCIFAR-100, Mini-ImageNet, and Tiny-ImageNet demonstrate that FCILPT achieves\nsignificant accuracy improvements over the state-of-the-art methods.\n","authors":["Jiale Liu","Yu-Wei Zhan","Chong-Yu Zhang","Xin Luo","Zhen-Duo Chen","Yinwei Wei","Xin-Shun Xu"],"pdf_url":"https://arxiv.org/pdf/2310.08948v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2112.09726v3","updated":"2023-10-13T08:10:41Z","published":"2021-12-17T19:22:01Z","title":"Soundify: Matching Sound Effects to Video","summary":" In the art of video editing, sound helps add character to an object and\nimmerse the viewer within a space. Through formative interviews with\nprofessional editors (N=10), we found that the task of adding sounds to video\ncan be challenging. This paper presents Soundify, a system that assists editors\nin matching sounds to video. Given a video, Soundify identifies matching\nsounds, synchronizes the sounds to the video, and dynamically adjusts panning\nand volume to create spatial audio. In a human evaluation study (N=889), we\nshow that Soundify is capable of matching sounds to video out-of-the-box for a\ndiverse range of audio categories. In a within-subjects expert study (N=12), we\ndemonstrate the usefulness of Soundify in helping video editors match sounds to\nvideo with lighter workload, reduced task completion time, and improved\nusability.\n","authors":["David Chuan-En Lin","Anastasis Germanidis","Cristóbal Valenzuela","Yining Shi","Nikolas Martelaro"],"pdf_url":"https://arxiv.org/pdf/2112.09726v3.pdf","comment":"Full paper in UIST 2023; Short paper in NeurIPS 2021 ML4CD Workshop;\n Online demo: http://soundify.cc"},{"id":"http://arxiv.org/abs/2310.08934v1","updated":"2023-10-13T08:00:33Z","published":"2023-10-13T08:00:33Z","title":"Online Adaptive Disparity Estimation for Dynamic Scenes in Structured\n Light Systems","summary":" In recent years, deep neural networks have shown remarkable progress in dense\ndisparity estimation from dynamic scenes in monocular structured light systems.\nHowever, their performance significantly drops when applied in unseen\nenvironments. To address this issue, self-supervised online adaptation has been\nproposed as a solution to bridge this performance gap. Unlike traditional\nfine-tuning processes, online adaptation performs test-time optimization to\nadapt networks to new domains. Therefore, achieving fast convergence during the\nadaptation process is critical for attaining satisfactory accuracy. In this\npaper, we propose an unsupervised loss function based on long sequential\ninputs. It ensures better gradient directions and faster convergence. Our loss\nfunction is designed using a multi-frame pattern flow, which comprises a set of\nsparse trajectories of the projected pattern along the sequence. We estimate\nthe sparse pseudo ground truth with a confidence mask using a filter-based\nmethod, which guides the online adaptation process. Our proposed framework\nsignificantly improves the online adaptation speed and achieves superior\nperformance on unseen data.\n","authors":["Rukun Qiao","Hiroshi Kawasaki","Hongbin Zha"],"pdf_url":"https://arxiv.org/pdf/2310.08934v1.pdf","comment":"Accpeted by 36th IEEE/RSJ International Conference on Intelligent\n Robots and Systems, 2023"},{"id":"http://arxiv.org/abs/2310.08932v1","updated":"2023-10-13T07:55:33Z","published":"2023-10-13T07:55:33Z","title":"TIDE: Temporally Incremental Disparity Estimation via Pattern Flow in\n Structured Light System","summary":" We introduced Temporally Incremental Disparity Estimation Network (TIDE-Net),\na learning-based technique for disparity computation in mono-camera structured\nlight systems. In our hardware setting, a static pattern is projected onto a\ndynamic scene and captured by a monocular camera. Different from most former\ndisparity estimation methods that operate in a frame-wise manner, our network\nacquires disparity maps in a temporally incremental way. Specifically, We\nexploit the deformation of projected patterns (named pattern flow ) on captured\nimage sequences, to model the temporal information. Notably, this newly\nproposed pattern flow formulation reflects the disparity changes along the\nepipolar line, which is a special form of optical flow. Tailored for pattern\nflow, the TIDE-Net, a recurrent architecture, is proposed and implemented. For\neach incoming frame, our model fuses correlation volumes (from current frame)\nand disparity (from former frame) warped by pattern flow. From fused features,\nthe final stage of TIDE-Net estimates the residual disparity rather than the\nfull disparity, as conducted by many previous methods. Interestingly, this\ndesign brings clear empirical advantages in terms of efficiency and\ngeneralization ability. Using only synthetic data for training, our extensitve\nevaluation results (w.r.t. both accuracy and efficienty metrics) show superior\nperformance than several SOTA models on unseen real data. The code is available\non https://github.com/CodePointer/TIDENet.\n","authors":["Rukun Qiao","Hiroshi Kawasaki","Hongbin Zha"],"pdf_url":"https://arxiv.org/pdf/2310.08932v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08929v1","updated":"2023-10-13T07:51:00Z","published":"2023-10-13T07:51:00Z","title":"Towards Interpretable Controllability in Object-Centric Learning","summary":" The binding problem in artificial neural networks is actively explored with\nthe goal of achieving human-level recognition skills through the comprehension\nof the world in terms of symbol-like entities. Especially in the field of\ncomputer vision, object-centric learning (OCL) is extensively researched to\nbetter understand complex scenes by acquiring object representations or slots.\nWhile recent studies in OCL have made strides with complex images or videos,\nthe interpretability and interactivity over object representation remain\nlargely uncharted, still holding promise in the field of OCL. In this paper, we\nintroduce a novel method, Slot Attention with Image Augmentation (SlotAug), to\nexplore the possibility of learning interpretable controllability over slots in\na self-supervised manner by utilizing an image augmentation strategy. We also\ndevise the concept of sustainability in controllable slots by introducing\niterative and reversible controls over slots with two proposed submethods:\nAuxiliary Identity Manipulation and Slot Consistency Loss. Extensive empirical\nstudies and theoretical validation confirm the effectiveness of our approach,\noffering a novel capability for interpretable and sustainable control of object\nrepresentations. Code will be available soon.\n","authors":["Jinwoo Kim","Janghyuk Choi","Jaehyun Kang","Changyeon Lee","Ho-Jin Choi","Seon Joo Kim"],"pdf_url":"https://arxiv.org/pdf/2310.08929v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08928v1","updated":"2023-10-13T07:50:37Z","published":"2023-10-13T07:50:37Z","title":"SIDE: Self-supervised Intermediate Domain Exploration for Source-free\n Domain Adaptation","summary":" Domain adaptation aims to alleviate the domain shift when transferring the\nknowledge learned from the source domain to the target domain. Due to privacy\nissues, source-free domain adaptation (SFDA), where source data is unavailable\nduring adaptation, has recently become very demanding yet challenging. Existing\nSFDA methods focus on either self-supervised learning of target samples or\nreconstruction of virtual source data. The former overlooks the transferable\nknowledge in the source model, whilst the latter introduces even more\nuncertainty. To address the above issues, this paper proposes self-supervised\nintermediate domain exploration (SIDE) that effectively bridges the domain gap\nwith an intermediate domain, where samples are cyclically filtered out in a\nself-supervised fashion. First, we propose cycle intermediate domain filtering\n(CIDF) to cyclically select intermediate samples with similar distributions\nover source and target domains. Second, with the aid of those intermediate\nsamples, an inter-domain gap transition (IDGT) module is developed to mitigate\npossible distribution mismatches between the source and target data. Finally,\nwe introduce cross-view consistency learning (CVCL) to maintain the intrinsic\nclass discriminability whilst adapting the model to the target domain.\nExtensive experiments on three popular benchmarks, i.e. Office-31, Office-Home\nand VisDA-C, show that our proposed SIDE achieves competitive performance\nagainst state-of-the-art methods.\n","authors":["Jiamei Liu","Han Sun","Yizhen Jia","Jie Qin","Huiyu Zhou","Ningzhong Liu"],"pdf_url":"https://arxiv.org/pdf/2310.08928v1.pdf","comment":"code at https://github.com/se111/SIDE"},{"id":"http://arxiv.org/abs/2201.03454v3","updated":"2023-10-13T07:48:24Z","published":"2022-01-10T16:53:39Z","title":"3D Face Morphing Attacks: Generation, Vulnerability and Detection","summary":" Face Recognition systems (FRS) have been found to be vulnerable to morphing\nattacks, where the morphed face image is generated by blending the face images\nfrom contributory data subjects. This work presents a novel direction for\ngenerating face-morphing attacks in 3D. To this extent, we introduced a novel\napproach based on blending 3D face point clouds corresponding to contributory\ndata subjects. The proposed method generates 3D face morphing by projecting the\ninput 3D face point clouds onto depth maps and 2D color images, followed by\nimage blending and wrapping operations performed independently on the color\nimages and depth maps. We then back-projected the 2D morphing color map and the\ndepth map to the point cloud using the canonical (fixed) view. Given that the\ngenerated 3D face morphing models will result in holes owing to a single\ncanonical view, we have proposed a new algorithm for hole filling that will\nresult in a high-quality 3D face morphing model. Extensive experiments were\nconducted on the newly generated 3D face dataset comprising 675 3D scans\ncorresponding to 41 unique data subjects and a publicly available database\n(Facescape) with 100 data subjects. Experiments were performed to benchmark the\nvulnerability of the {proposed 3D morph-generation scheme against} automatic\n2D, 3D FRS, and human observer analysis. We also presented a quantitative\nassessment of the quality of the generated 3D face-morphing models using eight\ndifferent quality metrics. Finally, we propose three different 3D face Morphing\nAttack Detection (3D-MAD) algorithms to benchmark the performance of 3D face\nmorphing attack detection techniques.\n","authors":["Jag Mohan Singh","Raghavendra Ramachandra"],"pdf_url":"https://arxiv.org/pdf/2201.03454v3.pdf","comment":"The paper is accepted at IEEE Transactions on Biometrics, Behavior\n and Identity Science"},{"id":"http://arxiv.org/abs/2310.08921v1","updated":"2023-10-13T07:46:57Z","published":"2023-10-13T07:46:57Z","title":"Feature Proliferation -- the \"Cancer\" in StyleGAN and its Treatments","summary":" Despite the success of StyleGAN in image synthesis, the images it synthesizes\nare not always perfect and the well-known truncation trick has become a\nstandard post-processing technique for StyleGAN to synthesize high-quality\nimages. Although effective, it has long been noted that the truncation trick\ntends to reduce the diversity of synthesized images and unnecessarily\nsacrifices many distinct image features. To address this issue, in this paper,\nwe first delve into the StyleGAN image synthesis mechanism and discover an\nimportant phenomenon, namely Feature Proliferation, which demonstrates how\nspecific features reproduce with forward propagation. Then, we show how the\noccurrence of Feature Proliferation results in StyleGAN image artifacts. As an\nanalogy, we refer to it as the\" cancer\" in StyleGAN from its proliferating and\nmalignant nature. Finally, we propose a novel feature rescaling method that\nidentifies and modulates risky features to mitigate feature proliferation.\nThanks to our discovery of Feature Proliferation, the proposed feature\nrescaling method is less destructive and retains more useful image features\nthan the truncation trick, as it is more fine-grained and works in a\nlower-level feature space rather than a high-level latent space. Experimental\nresults justify the validity of our claims and the effectiveness of the\nproposed feature rescaling method. Our code is available at https://github.\ncom/songc42/Feature-proliferation.\n","authors":["Shuang Song","Yuanbang Liang","Jing Wu","Yu-Kun Lai","Yipeng Qin"],"pdf_url":"https://arxiv.org/pdf/2310.08921v1.pdf","comment":"Accepted at ICCV 2023"},{"id":"http://arxiv.org/abs/2303.11793v2","updated":"2023-10-13T07:37:56Z","published":"2023-03-21T12:22:59Z","title":"OTJR: Optimal Transport Meets Optimal Jacobian Regularization for\n Adversarial Robustness","summary":" The Web, as a rich medium of diverse content, has been constantly under the\nthreat of malicious entities exploiting its vulnerabilities, especially with\nthe rapid proliferation of deep learning applications in various web services.\nOne such vulnerability, crucial to the fidelity and integrity of web content,\nis the susceptibility of deep neural networks to adversarial perturbations,\nespecially concerning images - a dominant form of data on the web. In light of\nthe recent advancements in the robustness of classifiers, we delve deep into\nthe intricacies of adversarial training (AT) and Jacobian regularization, two\npivotal defenses. Our work {is the} first carefully analyzes and characterizes\nthese two schools of approaches, both theoretically and empirically, to\ndemonstrate how each approach impacts the robust learning of a classifier.\nNext, we propose our novel Optimal Transport with Jacobian regularization\nmethod, dubbed~\\SystemName, jointly incorporating the input-output Jacobian\nregularization into the AT by leveraging the optimal transport theory. In\nparticular, we employ the Sliced Wasserstein (SW) distance that can efficiently\npush the adversarial samples' representations closer to those of clean samples,\nregardless of the number of classes within the dataset. The SW distance\nprovides the adversarial samples' movement directions, which are much more\ninformative and powerful for the Jacobian regularization. Our empirical\nevaluations set a new standard in the domain, with our method achieving\ncommendable accuracies of 51.41\\% on the ~\\CIFAR-10 and 28.49\\% on the\n~\\CIFAR-100 datasets under the AutoAttack metric. In a real-world\ndemonstration, we subject images sourced from the Internet to online\nadversarial attacks, reinforcing the efficacy and relevance of our model in\ndefending against sophisticated web-image perturbations.\n","authors":["Binh M. Le","Shahroz Tariq","Simon S. Woo"],"pdf_url":"https://arxiv.org/pdf/2303.11793v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08910v1","updated":"2023-10-13T07:31:04Z","published":"2023-10-13T07:31:04Z","title":"Scalarization for Multi-Task and Multi-Domain Learning at Scale","summary":" Training a single model on multiple input domains and/or output tasks allows\nfor compressing information from multiple sources into a unified backbone hence\nimproves model efficiency. It also enables potential positive knowledge\ntransfer across tasks/domains, leading to improved accuracy and data-efficient\ntraining. However, optimizing such networks is a challenge, in particular due\nto discrepancies between the different tasks or domains: Despite several\nhypotheses and solutions proposed over the years, recent work has shown that\nuniform scalarization training, i.e., simply minimizing the average of the task\nlosses, yields on-par performance with more costly SotA optimization methods.\nThis raises the issue of how well we understand the training dynamics of\nmulti-task and multi-domain networks. In this work, we first devise a\nlarge-scale unified analysis of multi-domain and multi-task learning to better\nunderstand the dynamics of scalarization across varied task/domain combinations\nand model sizes. Following these insights, we then propose to leverage\npopulation-based training to efficiently search for the optimal scalarization\nweights when dealing with a large number of tasks or domains.\n","authors":["Amelie Royer","Tijmen Blankevoort","Babak Ehteshami Bejnordi"],"pdf_url":"https://arxiv.org/pdf/2310.08910v1.pdf","comment":"NeurIPS 2023; https://openreview.net/forum?id=TSuq3debnD"},{"id":"http://arxiv.org/abs/2310.08904v1","updated":"2023-10-13T07:19:30Z","published":"2023-10-13T07:19:30Z","title":"3D Understanding of Deformable Linear Objects: Datasets and\n Transferability Benchmark","summary":" Deformable linear objects are vastly represented in our everyday lives. It is\noften challenging even for humans to visually understand them, as the same\nobject can be entangled so that it appears completely different. Examples of\ndeformable linear objects include blood vessels and wiring harnesses, vital to\nthe functioning of their corresponding systems, such as the human body and a\nvehicle. However, no point cloud datasets exist for studying 3D deformable\nlinear objects. Therefore, we are introducing two point cloud datasets,\nPointWire and PointVessel. We evaluated state-of-the-art methods on the\nproposed large-scale 3D deformable linear object benchmarks. Finally, we\nanalyzed the generalization capabilities of these methods by conducting\ntransferability experiments on the PointWire and PointVessel datasets.\n","authors":["Bare Luka Žagar","Tim Hertel","Mingyu Liu","Ekim Yurtsever","ALois C. Knoll"],"pdf_url":"https://arxiv.org/pdf/2310.08904v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.08247v3","updated":"2023-10-13T07:14:24Z","published":"2023-06-14T05:25:06Z","title":"Diffusion in Diffusion: Cyclic One-Way Diffusion for\n Text-Vision-Conditioned Generation","summary":" Originating from the diffusion phenomenon in physics that describes particle\nmovement, the diffusion generative models inherit the characteristics of\nstochastic random walk in the data space along the denoising trajectory.\nHowever, the intrinsic mutual interference among image regions contradicts the\nneed for practical downstream application scenarios where the preservation of\nlow-level pixel information from given conditioning is desired (e.g.,\ncustomization tasks like personalized generation and inpainting based on a\nuser-provided single image). In this work, we investigate the diffusion\n(physics) in diffusion (machine learning) properties and propose our Cyclic\nOne-Way Diffusion (COW) method to control the direction of diffusion phenomenon\ngiven a pre-trained frozen diffusion model for versatile customization\napplication scenarios, where the low-level pixel information from the\nconditioning needs to be preserved. Notably, unlike most current methods that\nincorporate additional conditions by fine-tuning the base text-to-image\ndiffusion model or learning auxiliary networks, our method provides a novel\nperspective to understand the task needs and is applicable to a wider range of\ncustomization scenarios in a learning-free manner. Extensive experiment results\nshow that our proposed COW can achieve more flexible customization based on\nstrict visual conditions in different application settings.\n","authors":["Ruoyu Wang","Yongqi Yang","Zhihao Qian","Ye Zhu","Yu Wu"],"pdf_url":"https://arxiv.org/pdf/2306.08247v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2202.03574v4","updated":"2023-10-13T07:13:51Z","published":"2022-02-04T12:30:49Z","title":"Structured Prediction Problem Archive","summary":" Structured prediction problems are one of the fundamental tools in machine\nlearning. In order to facilitate algorithm development for their numerical\nsolution, we collect in one place a large number of datasets in easy to read\nformats for a diverse set of problem classes. We provide archival links to\ndatasets, description of the considered problems and problem formats, and a\nshort summary of problem characteristics including size, number of instances\netc. For reference we also give a non-exhaustive selection of algorithms\nproposed in the literature for their solution. We hope that this central\nrepository will make benchmarking and comparison to established works easier.\nWe welcome submission of interesting new datasets and algorithms for inclusion\nin our archive.\n","authors":["Paul Swoboda","Ahmed Abbas","Florian Bernard","Andrea Hornakova","Paul Roetzer","Bogdan Savchynskyy"],"pdf_url":"https://arxiv.org/pdf/2202.03574v4.pdf","comment":"Added new shape matching instances based of learned descriptors"},{"id":"http://arxiv.org/abs/2305.15710v2","updated":"2023-10-13T07:04:13Z","published":"2023-05-25T04:44:50Z","title":"CUEING: a lightweight model to Capture hUman attEntion In driviNG","summary":" Discrepancies in decision-making between Autonomous Driving Systems (ADS) and\nhuman drivers underscore the need for intuitive human gaze predictors to bridge\nthis gap, thereby improving user trust and experience. Existing gaze datasets,\ndespite their value, suffer from noise that hampers effective training.\nFurthermore, current gaze prediction models exhibit inconsistency across\ndiverse scenarios and demand substantial computational resources, restricting\ntheir on-board deployment in autonomous vehicles. We propose a novel adaptive\ncleansing technique for purging noise from existing gaze datasets, coupled with\na robust, lightweight convolutional self-attention gaze prediction model. Our\napproach not only significantly enhances model generalizability and performance\nby up to 12.13% but also ensures a remarkable reduction in model complexity by\nup to 98.2% compared to the state-of-the art, making in-vehicle deployment\nfeasible to augment ADS decision visualization and performance.\n","authors":["Linfeng Liang","Yao Deng","Yang Zhang","Jianchao Lu","Chen Wang","Quanzheng Sheng","Xi Zheng"],"pdf_url":"https://arxiv.org/pdf/2305.15710v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08897v1","updated":"2023-10-13T06:58:52Z","published":"2023-10-13T06:58:52Z","title":"Self supervised convolutional kernel based handcrafted feature\n harmonization: Enhanced left ventricle hypertension disease phenotyping on\n echocardiography","summary":" Radiomics, a medical imaging technique, extracts quantitative handcrafted\nfeatures from images to predict diseases. Harmonization in those features\nensures consistent feature extraction across various imaging devices and\nprotocols. Methods for harmonization include standardized imaging protocols,\nstatistical adjustments, and evaluating feature robustness. Myocardial diseases\nsuch as Left Ventricular Hypertrophy (LVH) and Hypertensive Heart Disease (HHD)\nare diagnosed via echocardiography, but variable imaging settings pose\nchallenges. Harmonization techniques are crucial for applying handcrafted\nfeatures in disease diagnosis in such scenario. Self-supervised learning (SSL)\nenhances data understanding within limited datasets and adapts to diverse data\nsettings. ConvNeXt-V2 integrates convolutional layers into SSL, displaying\nsuperior performance in various tasks. This study focuses on convolutional\nfilters within SSL, using them as preprocessing to convert images into feature\nmaps for handcrafted feature harmonization. Our proposed method excelled in\nharmonization evaluation and exhibited superior LVH classification performance\ncompared to existing methods.\n","authors":["Jina Lee","Youngtaek Hong","Dawun Jeong","Yeonggul Jang","Sihyeon Jeong","Taekgeun Jung","Yeonyee E. Yoon","Inki Moon","Seung-Ah Lee","Hyuk-Jae Chang"],"pdf_url":"https://arxiv.org/pdf/2310.08897v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08892v1","updated":"2023-10-13T06:53:28Z","published":"2023-10-13T06:53:28Z","title":"Image Cropping under Design Constraints","summary":" Image cropping is essential in image editing for obtaining a compositionally\nenhanced image. In display media, image cropping is a prospective technique for\nautomatically creating media content. However, image cropping for media\ncontents is often required to satisfy various constraints, such as an aspect\nratio and blank regions for placing texts or objects. We call this problem\nimage cropping under design constraints. To achieve image cropping under design\nconstraints, we propose a score function-based approach, which computes scores\nfor cropped results whether aesthetically plausible and satisfies design\nconstraints. We explore two derived approaches, a proposal-based approach, and\na heatmap-based approach, and we construct a dataset for evaluating the\nperformance of the proposed approaches on image cropping under design\nconstraints. In experiments, we demonstrate that the proposed approaches\noutperform a baseline, and we observe that the proposal-based approach is\nbetter than the heatmap-based approach under the same computation cost, but the\nheatmap-based approach leads to better scores by increasing computation cost.\nThe experimental results indicate that balancing aesthetically plausible\nregions and satisfying design constraints is not a trivial problem and requires\nsensitive balance, and both proposed approaches are reasonable alternatives.\n","authors":["Takumi Nishiyasu","Wataru Shimoda","Yoichi Sato"],"pdf_url":"https://arxiv.org/pdf/2310.08892v1.pdf","comment":"ACMMM Asia accepted"},{"id":"http://arxiv.org/abs/2310.08888v1","updated":"2023-10-13T06:48:38Z","published":"2023-10-13T06:48:38Z","title":"A Hybrid Transfer Learning Assisted Decision Support System for Accurate\n Prediction of Alzheimer Disease","summary":" Alzheimer's disease (AD) is the most common long-term illness in elderly\npeople. In recent years, deep learning has become popular in the area of\nmedical imaging and has had a lot of success there. It has become the most\neffective way to look at medical images. When it comes to detecting AD, the\ndeep neural model is more accurate and effective than general machine learning.\nOur research contributes to the development of a more comprehensive\nunderstanding and detection of the disease by identifying four distinct classes\nthat are predictive of AD with a high weighted accuracy of 98.91%. A unique\nstrategy has been proposed to improve the accuracy of the imbalance dataset\nclassification problem via the combination of ensemble averaging models and\nfive different transfer learning models in this study.\nEfficientNetB0+Resnet152(effnet+res152) and\nInceptionV3+EfficientNetB0+Resnet50(incep+effnet+res50) models have been\nfine-tuned and have reached the highest weighted accuracy for multi-class AD\nstage classifications.\n","authors":["Mahin Khan Mahadi","Abdullah Abdullah","Jamal Uddin","Asif Newaz"],"pdf_url":"https://arxiv.org/pdf/2310.08888v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.03434v7","updated":"2023-10-13T06:46:27Z","published":"2022-12-07T03:39:18Z","title":"Name Your Colour For the Task: Artificially Discover Colour Naming via\n Colour Quantisation Transformer","summary":" The long-standing theory that a colour-naming system evolves under dual\npressure of efficient communication and perceptual mechanism is supported by\nmore and more linguistic studies, including analysing four decades of\ndiachronic data from the Nafaanra language. This inspires us to explore whether\nmachine learning could evolve and discover a similar colour-naming system via\noptimising the communication efficiency represented by high-level recognition\nperformance. Here, we propose a novel colour quantisation transformer,\nCQFormer, that quantises colour space while maintaining the accuracy of machine\nrecognition on the quantised images. Given an RGB image, Annotation Branch maps\nit into an index map before generating the quantised image with a colour\npalette; meanwhile the Palette Branch utilises a key-point detection way to\nfind proper colours in the palette among the whole colour space. By interacting\nwith colour annotation, CQFormer is able to balance both the machine vision\naccuracy and colour perceptual structure such as distinct and stable colour\ndistribution for discovered colour system. Very interestingly, we even observe\nthe consistent evolution pattern between our artificial colour system and basic\ncolour terms across human languages. Besides, our colour quantisation method\nalso offers an efficient quantisation method that effectively compresses the\nimage storage while maintaining high performance in high-level recognition\ntasks such as classification and detection. Extensive experiments demonstrate\nthe superior performance of our method with extremely low bit-rate colours,\nshowing potential to integrate into quantisation network to quantities from\nimage to network activation. The source code is available at\nhttps://github.com/ryeocthiv/CQFormer\n","authors":["Shenghan Su","Lin Gu","Yue Yang","Zenghui Zhang","Tatsuya Harada"],"pdf_url":"https://arxiv.org/pdf/2212.03434v7.pdf","comment":"ICCV 2023 Oral"},{"id":"http://arxiv.org/abs/2310.08884v1","updated":"2023-10-13T06:34:23Z","published":"2023-10-13T06:34:23Z","title":"Extending Multi-modal Contrastive Representations","summary":" Multi-modal contrastive representation (MCR) of more than three modalities is\ncritical in multi-modal learning. Although recent methods showcase impressive\nachievements, the high dependence on large-scale, high-quality paired data and\nthe expensive training costs limit their further development. Inspired by\nrecent C-MCR, this paper proposes Extending Multimodal Contrastive\nRepresentation (Ex-MCR), a training-efficient and paired-data-free method to\nflexibly learn unified contrastive representation space for more than three\nmodalities by integrating the knowledge of existing MCR spaces. Specifically,\nEx-MCR aligns multiple existing MCRs into the same based MCR, which can\neffectively preserve the original semantic alignment of the based MCR. Besides,\nwe comprehensively enhance the entire learning pipeline for aligning MCR spaces\nfrom the perspectives of training data, architecture, and learning objectives.\nWith the preserved original modality alignment and the enhanced space\nalignment, Ex-MCR shows superior representation learning performance and\nexcellent modality extensibility. To demonstrate the effectiveness of Ex-MCR,\nwe align the MCR spaces of CLAP (audio-text) and ULIP (3D-vision) into the CLIP\n(vision-text), leveraging the overlapping text and image modality,\nrespectively. Remarkably, without using any paired data, Ex-MCR learns a\n3D-image-text-audio unified contrastive representation, and it achieves\nstate-of-the-art performance on audio-visual, 3D-image, audio-text, visual-text\nretrieval, and 3D object classification tasks. More importantly, extensive\nqualitative results further demonstrate the emergent semantic alignment between\nthe extended modalities (e.g., audio and 3D), which highlights the great\npotential of modality extensibility.\n","authors":["Zehan Wang","Ziang Zhang","Luping Liu","Yang Zhao","Haifeng Huang","Tao Jin","Zhou Zhao"],"pdf_url":"https://arxiv.org/pdf/2310.08884v1.pdf","comment":"Our code is available at https://github.com/MCR-PEFT/Ex-MCR"},{"id":"http://arxiv.org/abs/2310.08872v1","updated":"2023-10-13T05:48:42Z","published":"2023-10-13T05:48:42Z","title":"R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image\n Generation","summary":" Recent text-to-image (T2I) diffusion models have achieved remarkable progress\nin generating high-quality images given text-prompts as input. However, these\nmodels fail to convey appropriate spatial composition specified by a layout\ninstruction. In this work, we probe into zero-shot grounded T2I generation with\ndiffusion models, that is, generating images corresponding to the input layout\ninformation without training auxiliary modules or finetuning diffusion models.\nWe propose a Region and Boundary (R&B) aware cross-attention guidance approach\nthat gradually modulates the attention maps of diffusion model during\ngenerative process, and assists the model to synthesize images (1) with high\nfidelity, (2) highly compatible with textual input, and (3) interpreting layout\ninstructions accurately. Specifically, we leverage the discrete sampling to\nbridge the gap between consecutive attention maps and discrete layout\nconstraints, and design a region-aware loss to refine the generative layout\nduring diffusion process. We further propose a boundary-aware loss to\nstrengthen object discriminability within the corresponding regions.\nExperimental results show that our method outperforms existing state-of-the-art\nzero-shot grounded T2I generation methods by a large margin both qualitatively\nand quantitatively on several benchmarks.\n","authors":["Jiayu Xiao","Liang Li","Henglei Lv","Shuhui Wang","Qingming Huang"],"pdf_url":"https://arxiv.org/pdf/2310.08872v1.pdf","comment":"Preprint. Under review"},{"id":"http://arxiv.org/abs/2310.03986v2","updated":"2023-10-13T05:35:40Z","published":"2023-10-06T03:04:21Z","title":"Robust Multimodal Learning with Missing Modalities via\n Parameter-Efficient Adaptation","summary":" Multimodal learning seeks to utilize data from multiple sources to improve\nthe overall performance of downstream tasks. It is desirable for redundancies\nin the data to make multimodal systems robust to missing or corrupted\nobservations in some correlated modalities. However, we observe that the\nperformance of several existing multimodal networks significantly deteriorates\nif one or multiple modalities are absent at test time. To enable robustness to\nmissing modalities, we propose simple and parameter-efficient adaptation\nprocedures for pretrained multimodal networks. In particular, we exploit\nlow-rank adaptation and modulation of intermediate features to compensate for\nthe missing modalities. We demonstrate that such adaptation can partially\nbridge performance drop due to missing modalities and outperform independent,\ndedicated networks trained for the available modality combinations in some\ncases. The proposed adaptation requires extremely small number of parameters\n(e.g., fewer than 0.7% of the total parameters in most experiments). We conduct\na series of experiments to highlight the robustness of our proposed method\nusing diverse datasets for RGB-thermal and RGB-Depth semantic segmentation,\nmultimodal material segmentation, and multimodal sentiment analysis tasks. Our\nproposed method demonstrates versatility across various tasks and datasets, and\noutperforms existing methods for robust multimodal learning with missing\nmodalities.\n","authors":["Md Kaykobad Reza","Ashley Prater-Bennette","M. Salman Asif"],"pdf_url":"https://arxiv.org/pdf/2310.03986v2.pdf","comment":"18 pages, 3 figures, 11 tables"},{"id":"http://arxiv.org/abs/2310.08861v1","updated":"2023-10-13T05:08:35Z","published":"2023-10-13T05:08:35Z","title":"Re-initialization-free Level Set Method via Molecular Beam Epitaxy\n Equation Regularization for Image Segmentation","summary":" Variational level set method has become a powerful tool in image segmentation\ndue to its ability to handle complex topological changes and maintain\ncontinuity and smoothness in the process of evolution. However its evolution\nprocess can be unstable, which results in over flatted or over sharpened\ncontours and segmentation failure. To improve the accuracy and stability of\nevolution, we propose a high-order level set variational segmentation method\nintegrated with molecular beam epitaxy (MBE) equation regularization. This\nmethod uses the crystal growth in the MBE process to limit the evolution of the\nlevel set function, and thus can avoid the re-initialization in the evolution\nprocess and regulate the smoothness of the segmented curve. It also works for\nnoisy images with intensity inhomogeneity, which is a challenge in image\nsegmentation. To solve the variational model, we derive the gradient flow and\ndesign scalar auxiliary variable (SAV) scheme coupled with fast Fourier\ntransform (FFT), which can significantly improve the computational efficiency\ncompared with the traditional semi-implicit and semi-explicit scheme. Numerical\nexperiments show that the proposed method can generate smooth segmentation\ncurves, retain fine segmentation targets and obtain robust segmentation results\nof small objects. Compared to existing level set methods, this model is\nstate-of-the-art in both accuracy and efficiency.\n","authors":["Fanghui Song","Jiebao Sun","Shengzhu Shi","Zhichang Guo","Dazhi Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.08861v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.02601v3","updated":"2023-10-13T05:04:35Z","published":"2023-10-04T06:14:06Z","title":"MagicDrive: Street View Generation with Diverse 3D Geometry Control","summary":" Recent advancements in diffusion models have significantly enhanced the data\nsynthesis with 2D control. Yet, precise 3D control in street view generation,\ncrucial for 3D perception tasks, remains elusive. Specifically, utilizing\nBird's-Eye View (BEV) as the primary condition often leads to challenges in\ngeometry control (e.g., height), affecting the representation of object shapes,\nocclusion patterns, and road surface elevations, all of which are essential to\nperception data synthesis, especially for 3D object detection tasks. In this\npaper, we introduce MagicDrive, a novel street view generation framework\noffering diverse 3D geometry controls, including camera poses, road maps, and\n3D bounding boxes, together with textual descriptions, achieved through\ntailored encoding strategies. Besides, our design incorporates a cross-view\nattention module, ensuring consistency across multiple camera views. With\nMagicDrive, we achieve high-fidelity street-view synthesis that captures\nnuanced 3D geometry and various scene descriptions, enhancing tasks like BEV\nsegmentation and 3D object detection.\n","authors":["Ruiyuan Gao","Kai Chen","Enze Xie","Lanqing Hong","Zhenguo Li","Dit-Yan Yeung","Qiang Xu"],"pdf_url":"https://arxiv.org/pdf/2310.02601v3.pdf","comment":"Project Page: https://flymin.github.io/magicdrive"},{"id":"http://arxiv.org/abs/2310.08854v1","updated":"2023-10-13T04:48:32Z","published":"2023-10-13T04:48:32Z","title":"Rank-DETR for High Quality Object Detection","summary":" Modern detection transformers (DETRs) use a set of object queries to predict\na list of bounding boxes, sort them by their classification confidence scores,\nand select the top-ranked predictions as the final detection results for the\ngiven input image. A highly performant object detector requires accurate\nranking for the bounding box predictions. For DETR-based detectors, the\ntop-ranked bounding boxes suffer from less accurate localization quality due to\nthe misalignment between classification scores and localization accuracy, thus\nimpeding the construction of high-quality detectors. In this work, we introduce\na simple and highly performant DETR-based object detector by proposing a series\nof rank-oriented designs, combinedly called Rank-DETR. Our key contributions\ninclude: (i) a rank-oriented architecture design that can prompt positive\npredictions and suppress the negative ones to ensure lower false positive\nrates, as well as (ii) a rank-oriented loss function and matching cost design\nthat prioritizes predictions of more accurate localization accuracy during\nranking to boost the AP under high IoU thresholds. We apply our method to\nimprove the recent SOTA methods (e.g., H-DETR and DINO-DETR) and report strong\nCOCO object detection results when using different backbones such as\nResNet-$50$, Swin-T, and Swin-L, demonstrating the effectiveness of our\napproach. Code is available at \\url{https://github.com/LeapLabTHU/Rank-DETR}.\n","authors":["Yifan Pu","Weicong Liang","Yiduo Hao","Yuhui Yuan","Yukang Yang","Chao Zhang","Han Hu","Gao Huang"],"pdf_url":"https://arxiv.org/pdf/2310.08854v1.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.04780v3","updated":"2023-10-13T04:23:26Z","published":"2023-10-07T11:45:33Z","title":"IPMix: Label-Preserving Data Augmentation Method for Training Robust\n Classifiers","summary":" Data augmentation has been proven effective for training high-accuracy\nconvolutional neural network classifiers by preventing overfitting. However,\nbuilding deep neural networks in real-world scenarios requires not only high\naccuracy on clean data but also robustness when data distributions shift. While\nprior methods have proposed that there is a trade-off between accuracy and\nrobustness, we propose IPMix, a simple data augmentation approach to improve\nrobustness without hurting clean accuracy. IPMix integrates three levels of\ndata augmentation (image-level, patch-level, and pixel-level) into a coherent\nand label-preserving technique to increase the diversity of training data with\nlimited computational overhead. To further improve the robustness, IPMix\nintroduces structural complexity at different levels to generate more diverse\nimages and adopts the random mixing method for multi-scale information fusion.\nExperiments demonstrate that IPMix outperforms state-of-the-art corruption\nrobustness on CIFAR-C and ImageNet-C. In addition, we show that IPMix also\nsignificantly improves the other safety measures, including robustness to\nadversarial perturbations, calibration, prediction consistency, and anomaly\ndetection, achieving state-of-the-art or comparable results on several\nbenchmarks, including ImageNet-R, ImageNet-A, and ImageNet-O.\n","authors":["Zhenglin Huang","Xianan Bao","Na Zhang","Qingqi Zhang","Xiaomei Tu","Biao Wu","Xi Yang"],"pdf_url":"https://arxiv.org/pdf/2310.04780v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2206.12403v2","updated":"2023-10-13T03:48:11Z","published":"2022-06-24T17:59:02Z","title":"ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings","summary":" We present a scalable approach for learning open-world object-goal navigation\n(ObjectNav) -- the task of asking a virtual robot (agent) to find any instance\nof an object in an unexplored environment (e.g., \"find a sink\"). Our approach\nis entirely zero-shot -- i.e., it does not require ObjectNav rewards or\ndemonstrations of any kind. Instead, we train on the image-goal navigation\n(ImageNav) task, in which agents find the location where a picture (i.e., goal\nimage) was captured. Specifically, we encode goal images into a multimodal,\nsemantic embedding space to enable training semantic-goal navigation\n(SemanticNav) agents at scale in unannotated 3D environments (e.g., HM3D).\nAfter training, SemanticNav agents can be instructed to find objects described\nin free-form natural language (e.g., \"sink\", \"bathroom sink\", etc.) by\nprojecting language goals into the same multimodal, semantic embedding space.\nAs a result, our approach enables open-world ObjectNav. We extensively evaluate\nour agents on three ObjectNav datasets (Gibson, HM3D, and MP3D) and observe\nabsolute improvements in success of 4.2% - 20.0% over existing zero-shot\nmethods. For reference, these gains are similar or better than the 5%\nimprovement in success between the Habitat 2020 and 2021 ObjectNav challenge\nwinners. In an open-world setting, we discover that our agents can generalize\nto compound instructions with a room explicitly mentioned (e.g., \"Find a\nkitchen sink\") and when the target room can be inferred (e.g., \"Find a sink and\na stove\").\n","authors":["Arjun Majumdar","Gunjan Aggarwal","Bhavika Devnani","Judy Hoffman","Dhruv Batra"],"pdf_url":"https://arxiv.org/pdf/2206.12403v2.pdf","comment":"code: https://github.com/gunagg/zson"},{"id":"http://arxiv.org/abs/2304.12152v2","updated":"2023-10-13T03:40:42Z","published":"2023-04-24T15:03:37Z","title":"Efficient Halftoning via Deep Reinforcement Learning","summary":" Halftoning aims to reproduce a continuous-tone image with pixels whose\nintensities are constrained to two discrete levels. This technique has been\ndeployed on every printer, and the majority of them adopt fast methods (e.g.,\nordered dithering, error diffusion) that fail to render structural details,\nwhich determine halftone's quality. Other prior methods of pursuing visual\npleasure by searching for the optimal halftone solution, on the contrary,\nsuffer from their high computational cost. In this paper, we propose a fast and\nstructure-aware halftoning method via a data-driven approach. Specifically, we\nformulate halftoning as a reinforcement learning problem, in which each binary\npixel's value is regarded as an action chosen by a virtual agent with a shared\nfully convolutional neural network (CNN) policy. In the offline phase, an\neffective gradient estimator is utilized to train the agents in producing\nhigh-quality halftones in one action step. Then, halftones can be generated\nonline by one fast CNN inference. Besides, we propose a novel anisotropy\nsuppressing loss function, which brings the desirable blue-noise property.\nFinally, we find that optimizing SSIM could result in holes in flat areas,\nwhich can be avoided by weighting the metric with the contone's contrast map.\nExperiments show that our framework can effectively train a light-weight CNN,\nwhich is 15x faster than previous structure-aware methods, to generate\nblue-noise halftones with satisfactory visual quality. We also present a\nprototype of deep multitoning to demonstrate the extensibility of our method.\n","authors":["Haitian Jiang","Dongliang Xiong","Xiaowen Jiang","Li Ding","Liang Chen","Kai Huang"],"pdf_url":"https://arxiv.org/pdf/2304.12152v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.03601v2","updated":"2023-10-13T03:25:34Z","published":"2023-07-07T13:43:44Z","title":"GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest","summary":" Visual instruction tuning large language model(LLM) on image-text pairs has\nachieved general-purpose vision-language abilities. However, the lack of\nregion-text pairs limits their advancements to fine-grained multimodal\nunderstanding. In this paper, we propose spatial instruction tuning, which\nintroduces the reference to the region-of-interest(RoI) in the instruction.\nBefore sending to LLM, the reference is replaced by RoI features and\ninterleaved with language embeddings as a sequence. Our model GPT4RoI, trained\non 7 region-text pair datasets, brings an unprecedented interactive and\nconversational experience compared to previous image-level models. (1)\nInteraction beyond language: Users can interact with our model by both language\nand drawing bounding boxes to flexibly adjust the referring granularity. (2)\nVersatile multimodal abilities: A variety of attribute information within each\nRoI can be mined by GPT4RoI, e.g., color, shape, material, action, etc.\nFurthermore, it can reason about multiple RoIs based on common sense. On the\nVisual Commonsense Reasoning(VCR) dataset, GPT4RoI achieves a remarkable\naccuracy of 81.6%, surpassing all existing models by a significant margin (the\nsecond place is 75.6%) and almost reaching human-level performance of 85.0%.\nThe code, dataset, and demo can be found at\nhttps://github.com/jshilong/GPT4RoI.\n","authors":["Shilong Zhang","Peize Sun","Shoufa Chen","Min Xiao","Wenqi Shao","Wenwei Zhang","Yu Liu","Kai Chen","Ping Luo"],"pdf_url":"https://arxiv.org/pdf/2307.03601v2.pdf","comment":"Code has been released at https://github.com/jshilong/GPT4RoI"},{"id":"http://arxiv.org/abs/2310.01404v2","updated":"2023-10-13T03:14:16Z","published":"2023-10-02T17:59:03Z","title":"H-InDex: Visual Reinforcement Learning with Hand-Informed\n Representations for Dexterous Manipulation","summary":" Human hands possess remarkable dexterity and have long served as a source of\ninspiration for robotic manipulation. In this work, we propose a human\n$\\textbf{H}$and$\\textbf{-In}$formed visual representation learning framework to\nsolve difficult $\\textbf{Dex}$terous manipulation tasks ($\\textbf{H-InDex}$)\nwith reinforcement learning. Our framework consists of three stages: (i)\npre-training representations with 3D human hand pose estimation, (ii) offline\nadapting representations with self-supervised keypoint detection, and (iii)\nreinforcement learning with exponential moving average BatchNorm. The last two\nstages only modify $0.36\\%$ parameters of the pre-trained representation in\ntotal, ensuring the knowledge from pre-training is maintained to the full\nextent. We empirically study 12 challenging dexterous manipulation tasks and\nfind that H-InDex largely surpasses strong baseline methods and the recent\nvisual foundation models for motor control. Code is available at\nhttps://yanjieze.com/H-InDex .\n","authors":["Yanjie Ze","Yuyao Liu","Ruizhe Shi","Jiaxin Qin","Zhecheng Yuan","Jiashun Wang","Huazhe Xu"],"pdf_url":"https://arxiv.org/pdf/2310.01404v2.pdf","comment":"NeurIPS 2023. Code and videos: https://yanjieze.com/H-InDex"},{"id":"http://arxiv.org/abs/2303.05050v3","updated":"2023-10-13T02:44:40Z","published":"2023-03-09T05:54:42Z","title":"Lifelong-MonoDepth: Lifelong Learning for Multi-Domain Monocular Metric\n Depth Estimation","summary":" With the rapid advancements in autonomous driving and robot navigation, there\nis a growing demand for lifelong learning models capable of estimating metric\n(absolute) depth. Lifelong learning approaches potentially offer significant\ncost savings in terms of model training, data storage, and collection. However,\nthe quality of RGB images and depth maps is sensor-dependent, and depth maps in\nthe real world exhibit domain-specific characteristics, leading to variations\nin depth ranges. These challenges limit existing methods to lifelong learning\nscenarios with small domain gaps and relative depth map estimation. To\nfacilitate lifelong metric depth learning, we identify three crucial technical\nchallenges that require attention: i) developing a model capable of addressing\nthe depth scale variation through scale-aware depth learning, ii) devising an\neffective learning strategy to handle significant domain gaps, and iii)\ncreating an automated solution for domain-aware depth inference in practical\napplications. Based on the aforementioned considerations, in this paper, we\npresent i) a lightweight multi-head framework that effectively tackles the\ndepth scale imbalance, ii) an uncertainty-aware lifelong learning solution that\nadeptly handles significant domain gaps, and iii) an online domain-specific\npredictor selection method for real-time inference. Through extensive numerical\nstudies, we show that the proposed method can achieve good efficiency,\nstability, and plasticity, leading the benchmarks by 8% to 15%.\n","authors":["Junjie Hu","Chenyou Fan","Liguang Zhou","Qing Gao","Honghai Liu","Tin Lun Lam"],"pdf_url":"https://arxiv.org/pdf/2303.05050v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08826v1","updated":"2023-10-13T02:43:35Z","published":"2023-10-13T02:43:35Z","title":"Revisiting Multi-modal 3D Semantic Segmentation in Real-world Autonomous\n Driving","summary":" LiDAR and camera are two critical sensors for multi-modal 3D semantic\nsegmentation and are supposed to be fused efficiently and robustly to promise\nsafety in various real-world scenarios. However, existing multi-modal methods\nface two key challenges: 1) difficulty with efficient deployment and real-time\nexecution; and 2) drastic performance degradation under weak calibration\nbetween LiDAR and cameras. To address these challenges, we propose CPGNet-LCF,\na new multi-modal fusion framework extending the LiDAR-only CPGNet. CPGNet-LCF\nsolves the first challenge by inheriting the easy deployment and real-time\ncapabilities of CPGNet. For the second challenge, we introduce a novel weak\ncalibration knowledge distillation strategy during training to improve the\nrobustness against the weak calibration. CPGNet-LCF achieves state-of-the-art\nperformance on the nuScenes and SemanticKITTI benchmarks. Remarkably, it can be\neasily deployed to run in 20ms per frame on a single Tesla V100 GPU using\nTensorRT TF16 mode. Furthermore, we benchmark performance over four weak\ncalibration levels, demonstrating the robustness of our proposed approach.\n","authors":["Feng Jiang","Chaoping Tu","Gang Zhang","Jun Li","Hanqing Huang","Junyu Lin","Di Feng","Jian Pu"],"pdf_url":"https://arxiv.org/pdf/2310.08826v1.pdf","comment":"7 pages, 3 figures"},{"id":"http://arxiv.org/abs/2310.08825v1","updated":"2023-10-13T02:41:55Z","published":"2023-10-13T02:41:55Z","title":"From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language\n Models","summary":" Multi-modal Large Language Models (MLLMs) have made significant strides in\nexpanding the capabilities of Large Language Models (LLMs) through the\nincorporation of visual perception interfaces. Despite the emergence of\nexciting applications and the availability of diverse instruction tuning data,\nexisting approaches often rely on CLIP or its variants as the visual branch,\nand merely extract features from the deep layers. However, these methods lack a\ncomprehensive analysis of the visual encoders in MLLMs. In this paper, we\nconduct an extensive investigation into the effectiveness of different vision\nencoders within MLLMs. Our findings reveal that the shallow layer features of\nCLIP offer particular advantages for fine-grained tasks such as grounding and\nregion understanding. Surprisingly, the vision-only model DINO, which is not\npretrained with text-image alignment, demonstrates promising performance as a\nvisual branch within MLLMs. By simply equipping it with an MLP layer for\nalignment, DINO surpasses CLIP in fine-grained related perception tasks.\nBuilding upon these observations, we propose a simple yet effective feature\nmerging strategy, named COMM, that integrates CLIP and DINO with Multi-level\nfeatures Merging, to enhance the visual capabilities of MLLMs. We evaluate COMM\nthrough comprehensive experiments on a wide range of benchmarks, including\nimage captioning, visual question answering, visual grounding, and object\nhallucination. Experimental results demonstrate the superior performance of\nCOMM compared to existing methods, showcasing its enhanced visual capabilities\nwithin MLLMs. Code will be made available at\nhttps://github.com/YuchenLiu98/COMM.\n","authors":["Dongsheng Jiang","Yuchen Liu","Songlin Liu","Xiaopeng Zhang","Jin Li","Hongkai Xiong","Qi Tian"],"pdf_url":"https://arxiv.org/pdf/2310.08825v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.12966v3","updated":"2023-10-13T02:41:28Z","published":"2023-08-24T17:59:17Z","title":"Qwen-VL: A Versatile Vision-Language Model for Understanding,\n Localization, Text Reading, and Beyond","summary":" In this work, we introduce the Qwen-VL series, a set of large-scale\nvision-language models (LVLMs) designed to perceive and understand both texts\nand images. Starting from the Qwen-LM as a foundation, we endow it with visual\ncapacity by the meticulously designed (i) visual receptor, (ii) input-output\ninterface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal\ncleaned corpus. Beyond the conventional image description and\nquestion-answering, we implement the grounding and text-reading ability of\nQwen-VLs by aligning image-caption-box tuples. The resulting models, including\nQwen-VL and Qwen-VL-Chat, set new records for generalist models under similar\nmodel scales on a broad range of visual-centric benchmarks (e.g., image\ncaptioning, question answering, visual grounding) and different settings (e.g.,\nzero-shot, few-shot). Moreover, on real-world dialog benchmarks, our\ninstruction-tuned Qwen-VL-Chat also demonstrates superiority compared to\nexisting vision-language chatbots. Code, demo and models are available at\nhttps://github.com/QwenLM/Qwen-VL.\n","authors":["Jinze Bai","Shuai Bai","Shusheng Yang","Shijie Wang","Sinan Tan","Peng Wang","Junyang Lin","Chang Zhou","Jingren Zhou"],"pdf_url":"https://arxiv.org/pdf/2308.12966v3.pdf","comment":"Code, demo and models are available at\n https://github.com/QwenLM/Qwen-VL"},{"id":"http://arxiv.org/abs/2310.08820v1","updated":"2023-10-13T02:28:40Z","published":"2023-10-13T02:28:40Z","title":"SAM-guided Unsupervised Domain Adaptation for 3D Segmentation","summary":" Unsupervised domain adaptation (UDA) in 3D segmentation tasks presents a\nformidable challenge, primarily stemming from the sparse and unordered nature\nof point cloud data. Especially for LiDAR point clouds, the domain discrepancy\nbecomes obvious across varying capture scenes, fluctuating weather conditions,\nand the diverse array of LiDAR devices in use. While previous UDA methodologies\nhave often sought to mitigate this gap by aligning features between source and\ntarget domains, this approach falls short when applied to 3D segmentation due\nto the substantial domain variations. Inspired by the remarkable generalization\ncapabilities exhibited by the vision foundation model, SAM, in the realm of\nimage segmentation, our approach leverages the wealth of general knowledge\nembedded within SAM to unify feature representations across diverse 3D domains\nand further solves the 3D domain adaptation problem. Specifically, we harness\nthe corresponding images associated with point clouds to facilitate knowledge\ntransfer and propose an innovative hybrid feature augmentation methodology,\nwhich significantly enhances the alignment between the 3D feature space and\nSAM's feature space, operating at both the scene and instance levels. Our\nmethod is evaluated on many widely-recognized datasets and achieves\nstate-of-the-art performance.\n","authors":["Xidong Peng","Runnan Chen","Feng Qiao","Lingdong Kong","Youquan Liu","Tai Wang","Xinge Zhu","Yuexin Ma"],"pdf_url":"https://arxiv.org/pdf/2310.08820v1.pdf","comment":"submitted to ICLR 2024"},{"id":"http://arxiv.org/abs/2310.05873v3","updated":"2023-10-13T02:04:57Z","published":"2023-10-09T17:13:10Z","title":"Geom-Erasing: Geometry-Driven Removal of Implicit Concept in Diffusion\n Models","summary":" Fine-tuning diffusion models through personalized datasets is an acknowledged\nmethod for improving generation quality across downstream tasks, which,\nhowever, often inadvertently generates unintended concepts such as watermarks\nand QR codes, attributed to the limitations in image sources and collecting\nmethods within specific downstream tasks. Existing solutions suffer from\neliminating these unintentionally learned implicit concepts, primarily due to\nthe dependency on the model's ability to recognize concepts that it actually\ncannot discern. In this work, we introduce Geom-Erasing, a novel approach that\nsuccessfully removes the implicit concepts with either an additional accessible\nclassifier or detector model to encode geometric information of these concepts\ninto text domain. Moreover, we propose Implicit Concept, a novel image-text\ndataset imbued with three implicit concepts (i.e., watermarks, QR codes, and\ntext) for training and evaluation. Experimental results demonstrate that\nGeom-Erasing not only identifies but also proficiently eradicates implicit\nconcepts, revealing a significant improvement over the existing methods. The\nintegration of geometric information marks a substantial progression in the\nprecise removal of implicit concepts in diffusion models.\n","authors":["Zhili Liu","Kai Chen","Yifan Zhang","Jianhua Han","Lanqing Hong","Hang Xu","Zhenguo Li","Dit-Yan Yeung","James Kwok"],"pdf_url":"https://arxiv.org/pdf/2310.05873v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08815v1","updated":"2023-10-13T01:59:39Z","published":"2023-10-13T01:59:39Z","title":"Incremental Object Detection with CLIP","summary":" In the incremental detection task, unlike the incremental classification\ntask, data ambiguity exists due to the possibility of an image having different\nlabeled bounding boxes in multiple continuous learning stages. This phenomenon\noften impairs the model's ability to learn new classes. However, the forward\ncompatibility of the model is less considered in existing work, which hinders\nthe model's suitability for incremental learning. To overcome this obstacle, we\npropose to use a language-visual model such as CLIP to generate text feature\nembeddings for different class sets, which enhances the feature space globally.\nWe then employ the broad classes to replace the unavailable novel classes in\nthe early learning stage to simulate the actual incremental scenario. Finally,\nwe use the CLIP image encoder to identify potential objects in the proposals,\nwhich are classified into the background by the model. We modify the background\nlabels of those proposals to known classes and add the boxes to the training\nset to alleviate the problem of data ambiguity. We evaluate our approach on\nvarious incremental learning settings on the PASCAL VOC 2007 dataset, and our\napproach outperforms state-of-the-art methods, particularly for the new\nclasses.\n","authors":["Yupeng He","Ziyue Huang","Qingjie Liu","Yunhong Wang"],"pdf_url":"https://arxiv.org/pdf/2310.08815v1.pdf","comment":"10 pages, 2 figures"},{"id":"http://arxiv.org/abs/2305.02034v4","updated":"2023-10-13T01:49:42Z","published":"2023-05-03T10:58:07Z","title":"SAMRS: Scaling-up Remote Sensing Segmentation Dataset with Segment\n Anything Model","summary":" The success of the Segment Anything Model (SAM) demonstrates the significance\nof data-centric machine learning. However, due to the difficulties and high\ncosts associated with annotating Remote Sensing (RS) images, a large amount of\nvaluable RS data remains unlabeled, particularly at the pixel level. In this\nstudy, we leverage SAM and existing RS object detection datasets to develop an\nefficient pipeline for generating a large-scale RS segmentation dataset, dubbed\nSAMRS. SAMRS totally possesses 105,090 images and 1,668,241 instances,\nsurpassing existing high-resolution RS segmentation datasets in size by several\norders of magnitude. It provides object category, location, and instance\ninformation that can be used for semantic segmentation, instance segmentation,\nand object detection, either individually or in combination. We also provide a\ncomprehensive analysis of SAMRS from various aspects. Moreover, preliminary\nexperiments highlight the importance of conducting segmentation pre-training\nwith SAMRS to address task discrepancies and alleviate the limitations posed by\nlimited training data during fine-tuning. The code and dataset will be\navailable at https://github.com/ViTAE-Transformer/SAMRS.\n","authors":["Di Wang","Jing Zhang","Bo Du","Minqiang Xu","Lin Liu","Dacheng Tao","Liangpei Zhang"],"pdf_url":"https://arxiv.org/pdf/2305.02034v4.pdf","comment":"Accepted by NeurIPS 2023 Datasets and Benchmarks Track"},{"id":"http://arxiv.org/abs/2303.08320v4","updated":"2023-10-13T01:43:04Z","published":"2023-03-15T02:16:39Z","title":"VideoFusion: Decomposed Diffusion Models for High-Quality Video\n Generation","summary":" A diffusion probabilistic model (DPM), which constructs a forward diffusion\nprocess by gradually adding noise to data points and learns the reverse\ndenoising process to generate new samples, has been shown to handle complex\ndata distribution. Despite its recent success in image synthesis, applying DPMs\nto video generation is still challenging due to high-dimensional data spaces.\nPrevious methods usually adopt a standard diffusion process, where frames in\nthe same video clip are destroyed with independent noises, ignoring the content\nredundancy and temporal correlation. This work presents a decomposed diffusion\nprocess via resolving the per-frame noise into a base noise that is shared\namong all frames and a residual noise that varies along the time axis. The\ndenoising pipeline employs two jointly-learned networks to match the noise\ndecomposition accordingly. Experiments on various datasets confirm that our\napproach, termed as VideoFusion, surpasses both GAN-based and diffusion-based\nalternatives in high-quality video generation. We further show that our\ndecomposed formulation can benefit from pre-trained image diffusion models and\nwell-support text-conditioned video creation.\n","authors":["Zhengxiong Luo","Dayou Chen","Yingya Zhang","Yan Huang","Liang Wang","Yujun Shen","Deli Zhao","Jingren Zhou","Tieniu Tan"],"pdf_url":"https://arxiv.org/pdf/2303.08320v4.pdf","comment":"Accepted to CVPR2023"},{"id":"http://arxiv.org/abs/2310.08805v1","updated":"2023-10-13T01:27:36Z","published":"2023-10-13T01:27:36Z","title":"Two-Stage Deep Learning Framework for Quality Assessment of Left Atrial\n Late Gadolinium Enhanced MRI Images","summary":" Accurate assessment of left atrial fibrosis in patients with atrial\nfibrillation relies on high-quality 3D late gadolinium enhancement (LGE) MRI\nimages. However, obtaining such images is challenging due to patient motion,\nchanging breathing patterns, or sub-optimal choice of pulse sequence\nparameters. Automated assessment of LGE-MRI image diagnostic quality is\nclinically significant as it would enhance diagnostic accuracy, improve\nefficiency, ensure standardization, and contributes to better patient outcomes\nby providing reliable and high-quality LGE-MRI scans for fibrosis\nquantification and treatment planning. To address this, we propose a two-stage\ndeep-learning approach for automated LGE-MRI image diagnostic quality\nassessment. The method includes a left atrium detector to focus on relevant\nregions and a deep network to evaluate diagnostic quality. We explore two\ntraining strategies, multi-task learning, and pretraining using contrastive\nlearning, to overcome limited annotated data in medical imaging. Contrastive\nLearning result shows about $4\\%$, and $9\\%$ improvement in F1-Score and\nSpecificity compared to Multi-Task learning when there's limited data.\n","authors":["K M Arefeen Sultan","Benjamin Orkild","Alan Morris","Eugene Kholmovski","Erik Bieging","Eugene Kwan","Ravi Ranjan","Ed DiBella","Shireen Elhabian"],"pdf_url":"https://arxiv.org/pdf/2310.08805v1.pdf","comment":"Accepted to STACOM 2023. 11 pages, 3 figures"},{"id":"http://arxiv.org/abs/2310.08475v2","updated":"2023-10-13T01:12:25Z","published":"2023-10-12T16:32:44Z","title":"Can We Edit Multimodal Large Language Models?","summary":" In this paper, we focus on editing Multimodal Large Language Models (MLLMs).\nCompared to editing single-modal LLMs, multimodal model editing is more\nchallenging, which demands a higher level of scrutiny and careful consideration\nin the editing process. To facilitate research in this area, we construct a new\nbenchmark, dubbed MMEdit, for editing multimodal LLMs and establishing a suite\nof innovative metrics for evaluation. We conduct comprehensive experiments\ninvolving various model editing baselines and analyze the impact of editing\ndifferent components for multimodal LLMs. Empirically, we notice that previous\nbaselines can implement editing multimodal LLMs to some extent, but the effect\nis still barely satisfactory, indicating the potential difficulty of this task.\nWe hope that our work can provide the NLP community with insights. Code and\ndataset are available in https://github.com/zjunlp/EasyEdit.\n","authors":["Siyuan Cheng","Bozhong Tian","Qingbin Liu","Xi Chen","Yongheng Wang","Huajun Chen","Ningyu Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.08475v2.pdf","comment":"EMNLP 2023"}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2303.00807v3","updated":"2023-10-13T17:23:04Z","published":"2023-03-01T20:21:23Z","title":"UDAPDR: Unsupervised Domain Adaptation via LLM Prompting and\n Distillation of Rerankers","summary":" Many information retrieval tasks require large labeled datasets for\nfine-tuning. However, such datasets are often unavailable, and their utility\nfor real-world applications can diminish quickly due to domain shifts. To\naddress this challenge, we develop and motivate a method for using large\nlanguage models (LLMs) to generate large numbers of synthetic queries cheaply.\nThe method begins by generating a small number of synthetic queries using an\nexpensive LLM. After that, a much less expensive one is used to create large\nnumbers of synthetic queries, which are used to fine-tune a family of reranker\nmodels. These rerankers are then distilled into a single efficient retriever\nfor use in the target domain. We show that this technique boosts zero-shot\naccuracy in long-tail domains and achieves substantially lower latency than\nstandard reranking methods.\n","authors":["Jon Saad-Falcon","Omar Khattab","Keshav Santhanam","Radu Florian","Martin Franz","Salim Roukos","Avirup Sil","Md Arafat Sultan","Christopher Potts"],"pdf_url":"https://arxiv.org/pdf/2303.00807v3.pdf","comment":"Long Paper at Empirical Methods in Natural Language Processing\n (EMNLP) 2023"},{"id":"http://arxiv.org/abs/2310.09234v1","updated":"2023-10-13T16:37:53Z","published":"2023-10-13T16:37:53Z","title":"ClickPrompt: CTR Models are Strong Prompt Generators for Adapting\n Language Models to CTR Prediction","summary":" Click-through rate (CTR) prediction has become increasingly indispensable for\nvarious Internet applications. Traditional CTR models convert the multi-field\ncategorical data into ID features via one-hot encoding, and extract the\ncollaborative signals among features. Such a paradigm suffers from the problem\nof semantic information loss. Another line of research explores the potential\nof pretrained language models (PLMs) for CTR prediction by converting input\ndata into textual sentences through hard prompt templates. Although semantic\nsignals are preserved, they generally fail to capture the collaborative\ninformation (e.g., feature interactions, pure ID features), not to mention the\nunacceptable inference overhead brought by the huge model size. In this paper,\nwe aim to model both the semantic knowledge and collaborative knowledge for\naccurate CTR estimation, and meanwhile address the inference inefficiency\nissue. To benefit from both worlds and close their gaps, we propose a novel\nmodel-agnostic framework (i.e., ClickPrompt), where we incorporate CTR models\nto generate interaction-aware soft prompts for PLMs. We design a\nprompt-augmented masked language modeling (PA-MLM) pretraining task, where PLM\nhas to recover the masked tokens based on the language context, as well as the\nsoft prompts generated by CTR model. The collaborative and semantic knowledge\nfrom ID and textual features would be explicitly aligned and interacted via the\nprompt interface. Then, we can either tune the CTR model with PLM for superior\nperformance, or solely tune the CTR model without PLM for inference efficiency.\nExperiments on four real-world datasets validate the effectiveness of\nClickPrompt compared with existing baselines.\n","authors":["Jianghao Lin","Bo Chen","Hangyu Wang","Yunjia Xi","Yanru Qu","Xinyi Dai","Kangning Zhang","Ruiming Tang","Yong Yu","Weinan Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.09234v1.pdf","comment":"under review"},{"id":"http://arxiv.org/abs/2310.09233v1","updated":"2023-10-13T16:37:14Z","published":"2023-10-13T16:37:14Z","title":"AgentCF: Collaborative Learning with Autonomous Language Agents for\n Recommender Systems","summary":" Recently, there has been an emergence of employing LLM-powered agents as\nbelievable human proxies, based on their remarkable decision-making capability.\nHowever, existing studies mainly focus on simulating human dialogue. Human\nnon-verbal behaviors, such as item clicking in recommender systems, although\nimplicitly exhibiting user preferences and could enhance the modeling of users,\nhave not been deeply explored. The main reasons lie in the gap between language\nmodeling and behavior modeling, as well as the incomprehension of LLMs about\nuser-item relations.\n To address this issue, we propose AgentCF for simulating user-item\ninteractions in recommender systems through agent-based collaborative\nfiltering. We creatively consider not only users but also items as agents, and\ndevelop a collaborative learning approach that optimizes both kinds of agents\ntogether. Specifically, at each time step, we first prompt the user and item\nagents to interact autonomously. Then, based on the disparities between the\nagents' decisions and real-world interaction records, user and item agents are\nprompted to reflect on and adjust the misleading simulations collaboratively,\nthereby modeling their two-sided relations. The optimized agents can also\npropagate their preferences to other agents in subsequent interactions,\nimplicitly capturing the collaborative filtering idea. Overall, the optimized\nagents exhibit diverse interaction behaviors within our framework, including\nuser-item, user-user, item-item, and collective interactions. The results show\nthat these agents can demonstrate personalized behaviors akin to those of\nreal-world individuals, sparking the development of next-generation user\nbehavior simulation.\n","authors":["Junjie Zhang","Yupeng Hou","Ruobing Xie","Wenqi Sun","Julian McAuley","Wayne Xin Zhao","Leyu Lin","Ji-Rong Wen"],"pdf_url":"https://arxiv.org/pdf/2310.09233v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.11131v2","updated":"2023-10-13T16:13:52Z","published":"2023-08-22T02:25:04Z","title":"ReLLa: Retrieval-enhanced Large Language Models for Lifelong Sequential\n Behavior Comprehension in Recommendation","summary":" With large language models (LLMs) achieving remarkable breakthroughs in\nnatural language processing (NLP) domains, LLM-enhanced recommender systems\nhave received much attention and have been actively explored currently. In this\npaper, we focus on adapting and empowering a pure large language model for\nzero-shot and few-shot recommendation tasks. First and foremost, we identify\nand formulate the lifelong sequential behavior incomprehension problem for LLMs\nin recommendation domains, i.e., LLMs fail to extract useful information from a\ntextual context of long user behavior sequence, even if the length of context\nis far from reaching the context limitation of LLMs. To address such an issue\nand improve the recommendation performance of LLMs, we propose a novel\nframework, namely Retrieval-enhanced Large Language models (ReLLa) for\nrecommendation tasks in both zero-shot and few-shot settings. For zero-shot\nrecommendation, we perform semantic user behavior retrieval (SUBR) to improve\nthe data quality of testing samples, which greatly reduces the difficulty for\nLLMs to extract the essential knowledge from user behavior sequences. As for\nfew-shot recommendation, we further design retrieval-enhanced instruction\ntuning (ReiT) by adopting SUBR as a data augmentation technique for training\nsamples. Specifically, we develop a mixed training dataset consisting of both\nthe original data samples and their retrieval-enhanced counterparts. We conduct\nextensive experiments on three real-world public datasets to demonstrate the\nsuperiority of ReLLa compared with existing baseline models, as well as its\ncapability for lifelong sequential behavior comprehension. To be highlighted,\nwith only less than 10% training samples, few-shot ReLLa can outperform\ntraditional CTR models that are trained on the entire training set (e.g.,\nDCNv2, DIN, SIM).\n","authors":["Jianghao Lin","Rong Shan","Chenxu Zhu","Kounianhua Du","Bo Chen","Shigang Quan","Ruiming Tang","Yong Yu","Weinan Zhang"],"pdf_url":"https://arxiv.org/pdf/2308.11131v2.pdf","comment":"Updated Version. Few-shot ReLLa is now able to outperform full-shot\n CTR models trained on the entire training set"},{"id":"http://arxiv.org/abs/2310.08891v1","updated":"2023-10-13T06:53:02Z","published":"2023-10-13T06:53:02Z","title":"EHI: End-to-end Learning of Hierarchical Index for Efficient Dense\n Retrieval","summary":" Dense embedding-based retrieval is now the industry standard for semantic\nsearch and ranking problems, like obtaining relevant web documents for a given\nquery. Such techniques use a two-stage process: (a) contrastive learning to\ntrain a dual encoder to embed both the query and documents and (b) approximate\nnearest neighbor search (ANNS) for finding similar documents for a given query.\nThese two stages are disjoint; the learned embeddings might be ill-suited for\nthe ANNS method and vice-versa, leading to suboptimal performance. In this\nwork, we propose End-to-end Hierarchical Indexing -- EHI -- that jointly learns\nboth the embeddings and the ANNS structure to optimize retrieval performance.\nEHI uses a standard dual encoder model for embedding queries and documents\nwhile learning an inverted file index (IVF) style tree structure for efficient\nANNS. To ensure stable and efficient learning of discrete tree-based ANNS\nstructure, EHI introduces the notion of dense path embedding that captures the\nposition of a query/document in the tree. We demonstrate the effectiveness of\nEHI on several benchmarks, including de-facto industry standard MS MARCO (Dev\nset and TREC DL19) datasets. For example, with the same compute budget, EHI\noutperforms state-of-the-art (SOTA) in by 0.6% (MRR@10) on MS MARCO dev set and\nby 4.2% (nDCG@10) on TREC DL19 benchmarks.\n","authors":["Ramnath Kumar","Anshul Mittal","Nilesh Gupta","Aditya Kusupati","Inderjit Dhillon","Prateek Jain"],"pdf_url":"https://arxiv.org/pdf/2310.08891v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09401v1","updated":"2023-10-13T20:53:50Z","published":"2023-10-13T20:53:50Z","title":"CIDER: Category-Guided Intent Disentanglement for Accurate Personalized\n News Recommendation","summary":" Personalized news recommendation aims to assist users in finding news\narticles that align with their interests, which plays a pivotal role in\nmitigating users' information overload problem. Although many recent works have\nbeen studied for better user and news representations, the following challenges\nhave been rarely studied: (C1) How to precisely comprehend a range of intents\ncoupled within a news article? and (C2) How to differentiate news articles with\nvarying post-read preferences in users' click history? To tackle both\nchallenges together, in this paper, we propose a novel personalized news\nrecommendation framework (CIDER) that employs (1) category-guided intent\ndisentanglement for (C1) and (2) consistency-based news representation for\n(C2). Furthermore, we incorporate a category prediction into the training\nprocess of CIDER as an auxiliary task, which provides supplementary supervisory\nsignals to enhance intent disentanglement. Extensive experiments on two\nreal-world datasets reveal that (1) CIDER provides consistent performance\nimprovements over seven state-of-the-art news recommendation methods and (2)\nthe proposed strategies significantly improve the model accuracy of CIDER.\n","authors":["Yunyong Ko","Seongeun Ryu","Sang-Wook Kim"],"pdf_url":"https://arxiv.org/pdf/2310.09401v1.pdf","comment":"8 pages, 6 figures, 6 tables"},{"id":"http://arxiv.org/abs/2310.09400v1","updated":"2023-10-13T20:52:18Z","published":"2023-10-13T20:52:18Z","title":"Collaborative Contextualization: Bridging the Gap between Collaborative\n Filtering and Pre-trained Language Model","summary":" Traditional recommender systems have heavily relied on identity\nrepresentations (IDs) to model users and items, while the ascendancy of\npre-trained language model (PLM) encoders has enriched the modeling of\ncontextual item descriptions. However, PLMs, although effective in addressing\nfew-shot, zero-shot, or unified modeling scenarios, often neglect the crucial\ncollaborative filtering signal. This neglect gives rise to two pressing\nchallenges: (1) Collaborative Contextualization, the seamless integration of\ncollaborative signals with contextual representations. (2) the imperative to\nbridge the representation gap between ID-based representations and contextual\nrepresentations while preserving their contextual semantics. In this paper, we\npropose CollabContext, a novel model that adeptly combines collaborative\nfiltering signals with contextual representations and aligns these\nrepresentations within the contextual space, preserving essential contextual\nsemantics. Experimental results across three real-world datasets demonstrate\nsubstantial improvements. Leveraging collaborative contextualization,\nCollabContext can also be effectively applied to cold-start scenarios,\nachieving remarkable enhancements in recommendation performance. The code is\navailable after the conference accepts the paper.\n","authors":["Chen Wang","Liangwei Yang","Zhiwei Liu","Xiaolong Liu","Mingdai Yang","Yueqing Liang","Philip S. Yu"],"pdf_url":"https://arxiv.org/pdf/2310.09400v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09341v1","updated":"2023-10-13T18:11:12Z","published":"2023-10-13T18:11:12Z","title":"Addressing the cold start problem in privacy preserving content-based\n recommender systems using hypercube graphs","summary":" The initial interaction of a user with a recommender system is problematic\nbecause, in such a so-called cold start situation, the recommender system has\nvery little information about the user, if any. Moreover, in collaborative\nfiltering, users need to share their preferences with the service provider by\nrating items while in content-based filtering there is no need for such\ninformation sharing. We have recently shown that a content-based model that\nuses hypercube graphs can determine user preferences with a very limited number\nof ratings while better preserving user privacy. In this paper, we confirm\nthese findings on the basis of experiments with more than 1,000 users in the\nrestaurant and movie domains. We show that the proposed method outperforms\nstandard machine learning algorithms when the number of available ratings is at\nmost 10, which often happens, and is competitive with larger training sets. In\naddition, training is simple and does not require large computational efforts.\n","authors":["Noa Tuval","Alain Hertz","Tsvi Kuflik"],"pdf_url":"https://arxiv.org/pdf/2310.09341v1.pdf","comment":"22 pages, 6 figures, 9 tables"},{"id":"http://arxiv.org/abs/2309.04761v2","updated":"2023-10-13T11:18:13Z","published":"2023-09-09T11:20:40Z","title":"A Comprehensive Survey on Deep Learning Techniques in Educational Data\n Mining","summary":" Educational Data Mining (EDM) has emerged as a vital field of research, which\nharnesses the power of computational techniques to analyze educational data.\nWith the increasing complexity and diversity of educational data, Deep Learning\ntechniques have shown significant advantages in addressing the challenges\nassociated with analyzing and modeling this data. This survey aims to\nsystematically review the state-of-the-art in EDM with Deep Learning. We begin\nby providing a brief introduction to EDM and Deep Learning, highlighting their\nrelevance in the context of modern education. Next, we present a detailed\nreview of Deep Learning techniques applied in four typical educational\nscenarios, including knowledge tracing, undesirable student detecting,\nperformance prediction, and personalized recommendation. Furthermore, a\ncomprehensive overview of public datasets and processing tools for EDM is\nprovided. Finally, we point out emerging trends and future directions in this\nresearch area.\n","authors":["Yuanguo Lin","Hong Chen","Wei Xia","Fan Lin","Pengcheng Wu","Zongyue Wang","Yong Liu"],"pdf_url":"https://arxiv.org/pdf/2309.04761v2.pdf","comment":"21 pages, 5 figures"}],"Machine Learning":[{"id":"http://arxiv.org/abs/2306.07261v4","updated":"2023-10-13T17:53:38Z","published":"2023-06-12T17:44:15Z","title":"Unprocessing Seven Years of Algorithmic Fairness","summary":" Seven years ago, researchers proposed a postprocessing method to equalize the\nerror rates of a model across different demographic groups. The work launched\nhundreds of papers purporting to improve over the postprocessing baseline. We\nempirically evaluate these claims through thousands of model evaluations on\nseveral tabular datasets. We find that the fairness-accuracy Pareto frontier\nachieved by postprocessing contains all other methods we were feasibly able to\nevaluate. In doing so, we address two common methodological errors that have\nconfounded previous observations. One relates to the comparison of methods with\ndifferent unconstrained base models. The other concerns methods achieving\ndifferent levels of constraint relaxation. At the heart of our study is a\nsimple idea we call unprocessing that roughly corresponds to the inverse of\npostprocessing. Unprocessing allows for a direct comparison of methods using\ndifferent underlying models and levels of relaxation.\n","authors":["André F. Cruz","Moritz Hardt"],"pdf_url":"https://arxiv.org/pdf/2306.07261v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09278v1","updated":"2023-10-13T17:40:39Z","published":"2023-10-13T17:40:39Z","title":"Disentangled Latent Spaces Facilitate Data-Driven Auxiliary Learning","summary":" In deep learning, auxiliary objectives are often used to facilitate learning\nin situations where data is scarce, or the principal task is extremely complex.\nThis idea is primarily inspired by the improved generalization capability\ninduced by solving multiple tasks simultaneously, which leads to a more robust\nshared representation. Nevertheless, finding optimal auxiliary tasks that give\nrise to the desired improvement is a crucial problem that often requires\nhand-crafted solutions or expensive meta-learning approaches. In this paper, we\npropose a novel framework, dubbed Detaux, whereby a weakly supervised\ndisentanglement procedure is used to discover new unrelated classification\ntasks and the associated labels that can be exploited with the principal task\nin any Multi-Task Learning (MTL) model. The disentanglement procedure works at\na representation level, isolating a subspace related to the principal task,\nplus an arbitrary number of orthogonal subspaces. In the most disentangled\nsubspaces, through a clustering procedure, we generate the additional\nclassification tasks, and the associated labels become their representatives.\nSubsequently, the original data, the labels associated with the principal task,\nand the newly discovered ones can be fed into any MTL framework. Extensive\nvalidation on both synthetic and real data, along with various ablation\nstudies, demonstrate promising results, revealing the potential in what has\nbeen, so far, an unexplored connection between learning disentangled\nrepresentations and MTL. The code will be made publicly available upon\nacceptance.\n","authors":["Geri Skenderi","Luigi Capogrosso","Andrea Toaiari","Matteo Denitto","Franco Fummi","Simone Melzi","Marco Cristani"],"pdf_url":"https://arxiv.org/pdf/2310.09278v1.pdf","comment":"Under review in Pattern Recognition Letters"},{"id":"http://arxiv.org/abs/2310.09277v1","updated":"2023-10-13T17:39:35Z","published":"2023-10-13T17:39:35Z","title":"A Hybrid Approach for Depression Classification: Random Forest-ANN\n Ensemble on Motor Activity Signals","summary":" Regarding the rising number of people suffering from mental health illnesses\nin today's society, the importance of mental health cannot be overstated.\nWearable sensors, which are increasingly widely available, provide a potential\nway to track and comprehend mental health issues. These gadgets not only\nmonitor everyday activities but also continuously record vital signs like heart\nrate, perhaps providing information on a person's mental state. Recent research\nhas used these sensors in conjunction with machine learning methods to identify\npatterns relating to different mental health conditions, highlighting the\nimmense potential of this data beyond simple activity monitoring. In this\nresearch, we present a novel algorithm called the Hybrid Random forest - Neural\nnetwork that has been tailored to evaluate sensor data from depressed patients.\nOur method has a noteworthy accuracy of 80\\% when evaluated on a special\ndataset that included both unipolar and bipolar depressive patients as well as\nhealthy controls. The findings highlight the algorithm's potential for reliably\ndetermining a person's depression condition using sensor data, making a\nsubstantial contribution to the area of mental health diagnostics.\n","authors":["Anket Patil","Dhairya Shah","Abhishek Shah","Mokshit Gala"],"pdf_url":"https://arxiv.org/pdf/2310.09277v1.pdf","comment":"8 pages"},{"id":"http://arxiv.org/abs/2310.09270v1","updated":"2023-10-13T17:35:04Z","published":"2023-10-13T17:35:04Z","title":"Retro-fallback: retrosynthetic planning in an uncertain world","summary":" Retrosynthesis is the task of proposing a series of chemical reactions to\ncreate a desired molecule from simpler, buyable molecules. While previous works\nhave proposed algorithms to find optimal solutions for a range of metrics (e.g.\nshortest, lowest-cost), these works generally overlook the fact that we have\nimperfect knowledge of the space of possible reactions, meaning plans created\nby the algorithm may not work in a laboratory. In this paper we propose a novel\nformulation of retrosynthesis in terms of stochastic processes to account for\nthis uncertainty. We then propose a novel greedy algorithm called\nretro-fallback which maximizes the probability that at least one synthesis plan\ncan be executed in the lab. Using in-silico benchmarks we demonstrate that\nretro-fallback generally produces better sets of synthesis plans than the\npopular MCTS and retro* algorithms.\n","authors":["Austin Tripp","Krzysztof Maziarz","Sarah Lewis","Marwin Segler","José Miguel Hernández-Lobato"],"pdf_url":"https://arxiv.org/pdf/2310.09270v1.pdf","comment":"39 pages (including appendices). Currently undergoing peer review"},{"id":"http://arxiv.org/abs/2310.09267v1","updated":"2023-10-13T17:25:11Z","published":"2023-10-13T17:25:11Z","title":"Genetic algorithms are strong baselines for molecule generation","summary":" Generating molecules, both in a directed and undirected fashion, is a huge\npart of the drug discovery pipeline. Genetic algorithms (GAs) generate\nmolecules by randomly modifying known molecules. In this paper we show that GAs\nare very strong algorithms for such tasks, outperforming many complicated\nmachine learning methods: a result which many researchers may find surprising.\nWe therefore propose insisting during peer review that new algorithms must have\nsome clear advantage over GAs, which we call the GA criterion. Ultimately our\nwork suggests that a lot of research in molecule generation should be\nre-assessed.\n","authors":["Austin Tripp","José Miguel Hernández-Lobato"],"pdf_url":"https://arxiv.org/pdf/2310.09267v1.pdf","comment":"Currently under review. Code will be made available at a later date"},{"id":"http://arxiv.org/abs/2310.09266v1","updated":"2023-10-13T17:24:52Z","published":"2023-10-13T17:24:52Z","title":"User Inference Attacks on Large Language Models","summary":" Fine-tuning is a common and effective method for tailoring large language\nmodels (LLMs) to specialized tasks and applications. In this paper, we study\nthe privacy implications of fine-tuning LLMs on user data. To this end, we\ndefine a realistic threat model, called user inference, wherein an attacker\ninfers whether or not a user's data was used for fine-tuning. We implement\nattacks for this threat model that require only a small set of samples from a\nuser (possibly different from the samples used for training) and black-box\naccess to the fine-tuned LLM. We find that LLMs are susceptible to user\ninference attacks across a variety of fine-tuning datasets, at times with near\nperfect attack success rates. Further, we investigate which properties make\nusers vulnerable to user inference, finding that outlier users (i.e. those with\ndata distributions sufficiently different from other users) and users who\ncontribute large quantities of data are most susceptible to attack. Finally, we\nexplore several heuristics for mitigating privacy attacks. We find that\ninterventions in the training algorithm, such as batch or per-example gradient\nclipping and early stopping fail to prevent user inference. However, limiting\nthe number of fine-tuning samples from a single user can reduce attack\neffectiveness, albeit at the cost of reducing the total amount of fine-tuning\ndata.\n","authors":["Nikhil Kandpal","Krishna Pillutla","Alina Oprea","Peter Kairouz","Christopher A. Choquette-Choo","Zheng Xu"],"pdf_url":"https://arxiv.org/pdf/2310.09266v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09265v1","updated":"2023-10-13T17:23:17Z","published":"2023-10-13T17:23:17Z","title":"PromptRE: Weakly-Supervised Document-Level Relation Extraction via\n Prompting-Based Data Programming","summary":" Relation extraction aims to classify the relationships between two entities\ninto pre-defined categories. While previous research has mainly focused on\nsentence-level relation extraction, recent studies have expanded the scope to\ndocument-level relation extraction. Traditional relation extraction methods\nheavily rely on human-annotated training data, which is time-consuming and\nlabor-intensive. To mitigate the need for manual annotation, recent\nweakly-supervised approaches have been developed for sentence-level relation\nextraction while limited work has been done on document-level relation\nextraction. Weakly-supervised document-level relation extraction faces\nsignificant challenges due to an imbalanced number \"no relation\" instances and\nthe failure of directly probing pretrained large language models for document\nrelation extraction. To address these challenges, we propose PromptRE, a novel\nweakly-supervised document-level relation extraction method that combines\nprompting-based techniques with data programming. Furthermore, PromptRE\nincorporates the label distribution and entity types as prior knowledge to\nimprove the performance. By leveraging the strengths of both prompting and data\nprogramming, PromptRE achieves improved performance in relation classification\nand effectively handles the \"no relation\" problem. Experimental results on\nReDocRED, a benchmark dataset for document-level relation extraction,\ndemonstrate the superiority of PromptRE over baseline approaches.\n","authors":["Chufan Gao","Xulin Fan","Jimeng Sun","Xuan Wang"],"pdf_url":"https://arxiv.org/pdf/2310.09265v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2209.12127v3","updated":"2023-10-13T17:21:46Z","published":"2022-09-25T02:56:01Z","title":"SpeedLimit: Neural Architecture Search for Quantized Transformer Models","summary":" While research in the field of transformer models has primarily focused on\nenhancing performance metrics such as accuracy and perplexity, practical\napplications in industry often necessitate a rigorous consideration of\ninference latency constraints. Addressing this challenge, we introduce\nSpeedLimit, a novel Neural Architecture Search (NAS) technique that optimizes\naccuracy whilst adhering to an upper-bound latency constraint. Our method\nincorporates 8-bit integer quantization in the search process to outperform the\ncurrent state-of-the-art technique. Our results underline the feasibility and\nefficacy of seeking an optimal balance between performance and latency,\nproviding new avenues for deploying state-of-the-art transformer models in\nlatency-sensitive environments.\n","authors":["Yuji Chai","Luke Bailey","Yunho Jin","Matthew Karle","Glenn G. Ko","David Brooks","Gu-Yeon Wei","H. T. Kung"],"pdf_url":"https://arxiv.org/pdf/2209.12127v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09259v1","updated":"2023-10-13T17:15:05Z","published":"2023-10-13T17:15:05Z","title":"Towards End-to-end 4-Bit Inference on Generative Large Language Models","summary":" We show that the majority of the inference computations for large generative\nmodels such as LLaMA and OPT can be performed with both weights and activations\nbeing cast to 4 bits, in a way that leads to practical speedups while at the\nsame time maintaining good accuracy. We achieve this via a hybrid quantization\nstrategy called QUIK, which compresses most of the weights and activations to\n4-bit, while keeping some outlier weights and activations in higher-precision.\nCrucially, our scheme is designed with computational efficiency in mind: we\nprovide GPU kernels with highly-efficient layer-wise runtimes, which lead to\npractical end-to-end throughput improvements of up to 3.1x relative to FP16\nexecution. Code and models are provided at https://github.com/IST-DASLab/QUIK.\n","authors":["Saleh Ashkboos","Ilia Markov","Elias Frantar","Tingxuan Zhong","Xincheng Wang","Jie Ren","Torsten Hoefler","Dan Alistarh"],"pdf_url":"https://arxiv.org/pdf/2310.09259v1.pdf","comment":"9 pages"},{"id":"http://arxiv.org/abs/2310.09254v1","updated":"2023-10-13T17:12:04Z","published":"2023-10-13T17:12:04Z","title":"Generative Entropic Neural Optimal Transport To Map Within and Across\n Spaces","summary":" Learning measure-to-measure mappings is a crucial task in machine learning,\nfeatured prominently in generative modeling. Recent years have witnessed a\nsurge of techniques that draw inspiration from optimal transport (OT) theory.\nCombined with neural network models, these methods collectively known as\n\\textit{Neural OT} use optimal transport as an inductive bias: such mappings\nshould be optimal w.r.t. a given cost function, in the sense that they are able\nto move points in a thrifty way, within (by minimizing displacements) or across\nspaces (by being isometric). This principle, while intuitive, is often\nconfronted with several practical challenges that require adapting the OT\ntoolbox: cost functions other than the squared-Euclidean cost can be\nchallenging to handle, the deterministic formulation of Monge maps leaves\nlittle flexibility, mapping across incomparable spaces raises multiple\nchallenges, while the mass conservation constraint inherent to OT can provide\ntoo much credit to outliers. While each of these mismatches between practice\nand theory has been addressed independently in various works, we propose in\nthis work an elegant framework to unify them, called \\textit{generative\nentropic neural optimal transport} (GENOT). GENOT can accommodate any cost\nfunction; handles randomness using conditional generative models; can map\npoints across incomparable spaces, and can be used as an \\textit{unbalanced}\nsolver. We evaluate our approach through experiments conducted on various\nsynthetic datasets and demonstrate its practicality in single-cell biology. In\nthis domain, GENOT proves to be valuable for tasks such as modeling cell\ndevelopment, predicting cellular responses to drugs, and translating between\ndifferent data modalities of cells.\n","authors":["Dominik Klein","Théo Uscidda","Fabian Theis","Marco Cuturi"],"pdf_url":"https://arxiv.org/pdf/2310.09254v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09250v1","updated":"2023-10-13T17:06:34Z","published":"2023-10-13T17:06:34Z","title":"It's an Alignment, Not a Trade-off: Revisiting Bias and Variance in Deep\n Models","summary":" Classical wisdom in machine learning holds that the generalization error can\nbe decomposed into bias and variance, and these two terms exhibit a\n\\emph{trade-off}. However, in this paper, we show that for an ensemble of deep\nlearning based classification models, bias and variance are \\emph{aligned} at a\nsample level, where squared bias is approximately \\emph{equal} to variance for\ncorrectly classified sample points. We present empirical evidence confirming\nthis phenomenon in a variety of deep learning models and datasets. Moreover, we\nstudy this phenomenon from two theoretical perspectives: calibration and neural\ncollapse. We first show theoretically that under the assumption that the models\nare well calibrated, we can observe the bias-variance alignment. Second,\nstarting from the picture provided by the neural collapse theory, we show an\napproximate correlation between bias and variance.\n","authors":["Lin Chen","Michal Lukasik","Wittawat Jitkrittum","Chong You","Sanjiv Kumar"],"pdf_url":"https://arxiv.org/pdf/2310.09250v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09247v1","updated":"2023-10-13T16:53:25Z","published":"2023-10-13T16:53:25Z","title":"Hypernymy Understanding Evaluation of Text-to-Image Models via WordNet\n Hierarchy","summary":" Text-to-image synthesis has recently attracted widespread attention due to\nrapidly improving quality and numerous practical applications. However, the\nlanguage understanding capabilities of text-to-image models are still poorly\nunderstood, which makes it difficult to reason about prompt formulations that a\ngiven model would understand well. In this work, we measure the capability of\npopular text-to-image models to understand $\\textit{hypernymy}$, or the \"is-a\"\nrelation between words. We design two automatic metrics based on the WordNet\nsemantic hierarchy and existing image classifiers pretrained on ImageNet. These\nmetrics both enable broad quantitative comparison of linguistic capabilities\nfor text-to-image models and offer a way of finding fine-grained qualitative\ndifferences, such as words that are unknown to models and thus are difficult\nfor them to draw. We comprehensively evaluate popular text-to-image models,\nincluding GLIDE, Latent Diffusion, and Stable Diffusion, showing how our\nmetrics can provide a better understanding of the individual strengths and\nweaknesses of these models.\n","authors":["Anton Baryshnikov","Max Ryabinin"],"pdf_url":"https://arxiv.org/pdf/2310.09247v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09236v1","updated":"2023-10-13T16:40:29Z","published":"2023-10-13T16:40:29Z","title":"Time CNN and Graph Convolution Network for Epileptic Spike Detection in\n MEG Data","summary":" Magnetoencephalography (MEG) recordings of patients with epilepsy exhibit\nspikes, a typical biomarker of the pathology. Detecting those spikes allows\naccurate localization of brain regions triggering seizures. Spike detection is\noften performed manually. However, it is a burdensome and error prone task due\nto the complexity of MEG data. To address this problem, we propose a 1D\ntemporal convolutional neural network (Time CNN) coupled with a graph\nconvolutional network (GCN) to classify short time frames of MEG recording as\ncontaining a spike or not. Compared to other recent approaches, our models have\nfewer parameters to train and we propose to use a GCN to account for MEG\nsensors spatial relationships. Our models produce clinically relevant results\nand outperform deep learning-based state-of-the-art methods reaching a\nclassification f1-score of 76.7% on a balanced dataset and of 25.5% on a\nrealistic, highly imbalanced dataset, for the spike class.\n","authors":["Pauline Mouches","Thibaut Dejean","Julien Jung","Romain Bouet","Carole Lartizien","Romain Quentin"],"pdf_url":"https://arxiv.org/pdf/2310.09236v1.pdf","comment":"This work has been submitted to IEEE ISBI 2024 for possible\n publication"},{"id":"http://arxiv.org/abs/2310.09229v1","updated":"2023-10-13T16:31:51Z","published":"2023-10-13T16:31:51Z","title":"Insuring Smiles: Predicting routine dental coverage using Spark ML","summary":" Finding suitable health insurance coverage can be challenging for individuals\nand small enterprises in the USA. The Health Insurance Exchange Public Use\nFiles (Exchange PUFs) dataset provided by CMS offers valuable information on\nhealth and dental policies [1]. In this paper, we leverage machine learning\nalgorithms to predict if a health insurance plan covers routine dental services\nfor adults. By analyzing plan type, region, deductibles, out-of-pocket\nmaximums, and copayments, we employ Logistic Regression, Decision Tree, Random\nForest, Gradient Boost, Factorization Model and Support Vector Machine\nalgorithms. Our goal is to provide a clinical strategy for individuals and\nfamilies to select the most suitable insurance plan based on income and\nexpenses.\n","authors":["Aishwarya Gupta","Rahul S. Bhogale","Priyanka Thota","Prathushkumar Dathuri","Jongwook Woo"],"pdf_url":"https://arxiv.org/pdf/2310.09229v1.pdf","comment":"4 pages, 13 figures, 5 tables"},{"id":"http://arxiv.org/abs/2310.09222v1","updated":"2023-10-13T16:20:20Z","published":"2023-10-13T16:20:20Z","title":"Fast & Efficient Learning of Bayesian Networks from Data: Knowledge\n Discovery and Causality","summary":" Structure learning is essential for Bayesian networks (BNs) as it uncovers\ncausal relationships, and enables knowledge discovery, predictions, inferences,\nand decision-making under uncertainty. Two novel algorithms, FSBN and SSBN,\nbased on the PC algorithm, employ local search strategy and conditional\nindependence tests to learn the causal network structure from data. They\nincorporate d-separation to infer additional topology information, prioritize\nconditioning sets, and terminate the search immediately and efficiently. FSBN\nachieves up to 52% computation cost reduction, while SSBN surpasses it with a\nremarkable 72% reduction for a 200-node network. SSBN demonstrates further\nefficiency gains due to its intelligent strategy. Experimental studies show\nthat both algorithms match the induction quality of the PC algorithm while\nsignificantly reducing computation costs. This enables them to offer\ninterpretability and adaptability while reducing the computational burden,\nmaking them valuable for various applications in big data analytics.\n","authors":["Minn Sein","Fu Shunkai"],"pdf_url":"https://arxiv.org/pdf/2310.09222v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.03026v2","updated":"2023-10-13T16:13:43Z","published":"2023-10-04T17:59:49Z","title":"LanguageMPC: Large Language Models as Decision Makers for Autonomous\n Driving","summary":" Existing learning-based autonomous driving (AD) systems face challenges in\ncomprehending high-level information, generalizing to rare events, and\nproviding interpretability. To address these problems, this work employs Large\nLanguage Models (LLMs) as a decision-making component for complex AD scenarios\nthat require human commonsense understanding. We devise cognitive pathways to\nenable comprehensive reasoning with LLMs, and develop algorithms for\ntranslating LLM decisions into actionable driving commands. Through this\napproach, LLM decisions are seamlessly integrated with low-level controllers by\nguided parameter matrix adaptation. Extensive experiments demonstrate that our\nproposed method not only consistently surpasses baseline approaches in\nsingle-vehicle tasks, but also helps handle complex driving behaviors even\nmulti-vehicle coordination, thanks to the commonsense reasoning capabilities of\nLLMs. This paper presents an initial step toward leveraging LLMs as effective\ndecision-makers for intricate AD scenarios in terms of safety, efficiency,\ngeneralizability, and interoperability. We aspire for it to serve as\ninspiration for future research in this field. Project page:\nhttps://sites.google.com/view/llm-mpc\n","authors":["Hao Sha","Yao Mu","Yuxuan Jiang","Li Chen","Chenfeng Xu","Ping Luo","Shengbo Eben Li","Masayoshi Tomizuka","Wei Zhan","Mingyu Ding"],"pdf_url":"https://arxiv.org/pdf/2310.03026v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.16506v2","updated":"2023-10-13T16:09:19Z","published":"2023-07-31T09:08:40Z","title":"Explainable Equivariant Neural Networks for Particle Physics: PELICAN","summary":" PELICAN is a novel permutation equivariant and Lorentz invariant or covariant\naggregator network designed to overcome common limitations found in\narchitectures applied to particle physics problems. Compared to many approaches\nthat use non-specialized architectures that neglect underlying physics\nprinciples and require very large numbers of parameters, PELICAN employs a\nfundamentally symmetry group-based architecture that demonstrates benefits in\nterms of reduced complexity, increased interpretability, and raw performance.\nWe present a comprehensive study of the PELICAN algorithm architecture in the\ncontext of both tagging (classification) and reconstructing (regression)\nLorentz-boosted top quarks, including the difficult task of specifically\nidentifying and measuring the $W$-boson inside the dense environment of the\nLorentz-boosted top-quark hadronic final state. We also extend the application\nof PELICAN to the tasks of identifying quark-initiated vs.~gluon-initiated\njets, and a multi-class identification across five separate target categories\nof jets. When tested on the standard task of Lorentz-boosted top-quark tagging,\nPELICAN outperforms existing competitors with much lower model complexity and\nhigh sample efficiency. On the less common and more complex task of 4-momentum\nregression, PELICAN also outperforms hand-crafted, non-machine learning\nalgorithms. We discuss the implications of symmetry-restricted architectures\nfor the wider field of machine learning for physics.\n","authors":["Alexander Bogatskiy","Timothy Hoffman","David W. Miller","Jan T. Offermann","Xiaoyang Liu"],"pdf_url":"https://arxiv.org/pdf/2307.16506v2.pdf","comment":"50 pages, 34 figures, 12 tables"},{"id":"http://arxiv.org/abs/2310.09213v1","updated":"2023-10-13T16:07:31Z","published":"2023-10-13T16:07:31Z","title":"Unseen Image Synthesis with Diffusion Models","summary":" While the current trend in the generative field is scaling up towards larger\nmodels and more training data for generalized domain representations, we go the\nopposite direction in this work by synthesizing unseen domain images without\nadditional training. We do so via latent sampling and geometric optimization\nusing pre-trained and frozen Denoising Diffusion Probabilistic Models (DDPMs)\non single-domain datasets. Our key observation is that DDPMs pre-trained even\njust on single-domain images are already equipped with sufficient\nrepresentation abilities to reconstruct arbitrary images from the inverted\nlatent encoding following bi-directional deterministic diffusion and denoising\ntrajectories. This motivates us to investigate the statistical and geometric\nbehaviors of the Out-Of-Distribution (OOD) samples from unseen image domains in\nthe latent spaces along the denoising chain. Notably, we theoretically and\nempirically show that the inverted OOD samples also establish Gaussians that\nare distinguishable from the original In-Domain (ID) samples in the\nintermediate latent spaces, which allows us to sample from them directly.\nGeometrical domain-specific and model-dependent information of the unseen\nsubspace (e.g., sample-wise distance and angles) is used to further optimize\nthe sampled OOD latent encodings from the estimated Gaussian prior. We conduct\nextensive analysis and experiments using pre-trained diffusion models (DDPM,\niDDPM) on different datasets (AFHQ, CelebA-HQ, LSUN-Church, and LSUN-Bedroom),\nproving the effectiveness of this novel perspective to explore and re-think the\ndiffusion models' data synthesis generalization ability.\n","authors":["Ye Zhu","Yu Wu","Zhiwei Deng","Olga Russakovsky","Yan Yan"],"pdf_url":"https://arxiv.org/pdf/2310.09213v1.pdf","comment":"28 pages including appendices"},{"id":"http://arxiv.org/abs/2310.03684v2","updated":"2023-10-13T16:04:55Z","published":"2023-10-05T17:01:53Z","title":"SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks","summary":" Despite efforts to align large language models (LLMs) with human values,\nwidely-used LLMs such as GPT, Llama, Claude, and PaLM are susceptible to\njailbreaking attacks, wherein an adversary fools a targeted LLM into generating\nobjectionable content. To address this vulnerability, we propose SmoothLLM, the\nfirst algorithm designed to mitigate jailbreaking attacks on LLMs. Based on our\nfinding that adversarially-generated prompts are brittle to character-level\nchanges, our defense first randomly perturbs multiple copies of a given input\nprompt, and then aggregates the corresponding predictions to detect adversarial\ninputs. SmoothLLM reduces the attack success rate on numerous popular LLMs to\nbelow one percentage point, avoids unnecessary conservatism, and admits\nprovable guarantees on attack mitigation. Moreover, our defense uses\nexponentially fewer queries than existing attacks and is compatible with any\nLLM.\n","authors":["Alexander Robey","Eric Wong","Hamed Hassani","George J. Pappas"],"pdf_url":"https://arxiv.org/pdf/2310.03684v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09210v1","updated":"2023-10-13T16:04:06Z","published":"2023-10-13T16:04:06Z","title":"Regularization-Based Methods for Ordinal Quantification","summary":" Quantification, i.e., the task of training predictors of the class prevalence\nvalues in sets of unlabeled data items, has received increased attention in\nrecent years. However, most quantification research has concentrated on\ndeveloping algorithms for binary and multiclass problems in which the classes\nare not ordered. Here, we study the ordinal case, i.e., the case in which a\ntotal order is defined on the set of n>2 classes. We give three main\ncontributions to this field. First, we create and make available two datasets\nfor ordinal quantification (OQ) research that overcome the inadequacies of the\npreviously available ones. Second, we experimentally compare the most important\nOQ algorithms proposed in the literature so far. To this end, we bring together\nalgorithms proposed by authors from very different research fields, such as\ndata mining and astrophysics, who were unaware of each others' developments.\nThird, we propose a novel class of regularized OQ algorithms, which outperforms\nexisting algorithms in our experiments. The key to this gain in performance is\nthat our regularization prevents ordinally implausible estimates, assuming that\nordinal distributions tend to be smooth in practice. We informally verify this\nassumption for several real-world applications.\n","authors":["Mirko Bunse","Alejandro Moreo","Fabrizio Sebastiani","Martin Senz"],"pdf_url":"https://arxiv.org/pdf/2310.09210v1.pdf","comment":"45 pages"},{"id":"http://arxiv.org/abs/2305.08960v2","updated":"2023-10-13T15:52:36Z","published":"2023-05-15T19:02:46Z","title":"One Forward is Enough for Neural Network Training via Likelihood Ratio\n Method","summary":" While backpropagation (BP) is the mainstream approach for gradient\ncomputation in neural network training, its heavy reliance on the chain rule of\ndifferentiation constrains the designing flexibility of network architecture\nand training pipelines. We avoid the recursive computation in BP and develop a\nunified likelihood ratio (ULR) method for gradient estimation with just one\nforward propagation. Not only can ULR be extended to train a wide variety of\nneural network architectures, but the computation flow in BP can also be\nrearranged by ULR for better device adaptation. Moreover, we propose several\nvariance reduction techniques to further accelerate the training process. Our\nexperiments offer numerical results across diverse aspects, including various\nneural network training scenarios, computation flow rearrangement, and\nfine-tuning of pre-trained models. All findings demonstrate that ULR\neffectively enhances the flexibility of neural network training by permitting\nlocalized module training without compromising the global objective and\nsignificantly boosts the network robustness.\n","authors":["Jinyang Jiang","Zeliang Zhang","Chenliang Xu","Zhaofei Yu","Yijie Peng"],"pdf_url":"https://arxiv.org/pdf/2305.08960v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09203v1","updated":"2023-10-13T15:48:24Z","published":"2023-10-13T15:48:24Z","title":"SiamAF: Learning Shared Information from ECG and PPG Signals for Robust\n Atrial Fibrillation Detection","summary":" Atrial fibrillation (AF) is the most common type of cardiac arrhythmia. It is\nassociated with an increased risk of stroke, heart failure, and other\ncardiovascular complications, but can be clinically silent. Passive AF\nmonitoring with wearables may help reduce adverse clinical outcomes related to\nAF. Detecting AF in noisy wearable data poses a significant challenge, leading\nto the emergence of various deep learning techniques. Previous deep learning\nmodels learn from a single modality, either electrocardiogram (ECG) or\nphotoplethysmography (PPG) signals. However, deep learning models often\nstruggle to learn generalizable features and rely on features that are more\nsusceptible to corruption from noise, leading to sub-optimal performances in\ncertain scenarios, especially with low-quality signals. Given the increasing\navailability of ECG and PPG signal pairs from wearables and bedside monitors,\nwe propose a new approach, SiamAF, leveraging a novel Siamese network\narchitecture and joint learning loss function to learn shared information from\nboth ECG and PPG signals. At inference time, the proposed model is able to\npredict AF from either PPG or ECG and outperforms baseline methods on three\nexternal test sets. It learns medically relevant features as a result of our\nnovel architecture design. The proposed model also achieves comparable\nperformance to traditional learning regimes while requiring much fewer training\nlabels, providing a potential approach to reduce future reliance on manual\nlabeling.\n","authors":["Zhicheng Guo","Cheng Ding","Duc H. Do","Amit Shah","Randall J. Lee","Xiao Hu","Cynthia Rudin"],"pdf_url":"https://arxiv.org/pdf/2310.09203v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09202v1","updated":"2023-10-13T15:48:12Z","published":"2023-10-13T15:48:12Z","title":"Graph Condensation via Eigenbasis Matching","summary":" The increasing amount of graph data places requirements on the efficiency and\nscalability of graph neural networks (GNNs), despite their effectiveness in\nvarious graph-related applications. Recently, the emerging graph condensation\n(GC) sheds light on reducing the computational cost of GNNs from a data\nperspective. It aims to replace the real large graph with a significantly\nsmaller synthetic graph so that GNNs trained on both graphs exhibit comparable\nperformance. However, our empirical investigation reveals that existing GC\nmethods suffer from poor generalization, i.e., different GNNs trained on the\nsame synthetic graph have obvious performance gaps. What factors hinder the\ngeneralization of GC and how can we mitigate it? To answer this question, we\ncommence with a detailed analysis and observe that GNNs will inject spectrum\nbias into the synthetic graph, resulting in a distribution shift. To tackle\nthis issue, we propose eigenbasis matching for spectrum-free graph\ncondensation, named GCEM, which has two key steps: First, GCEM matches the\neigenbasis of the real and synthetic graphs, rather than the graph structure,\nwhich eliminates the spectrum bias of GNNs. Subsequently, GCEM leverages the\nspectrum of the real graph and the synthetic eigenbasis to construct the\nsynthetic graph, thereby preserving the essential structural information. We\ntheoretically demonstrate that the synthetic graph generated by GCEM maintains\nthe spectral similarity, i.e., total variation, of the real graph. Extensive\nexperiments conducted on five graph datasets verify that GCEM not only achieves\nstate-of-the-art performance over baselines but also significantly narrows the\nperformance gaps between different GNNs.\n","authors":["Yang Liu","Deyu Bo","Chuan Shi"],"pdf_url":"https://arxiv.org/pdf/2310.09202v1.pdf","comment":"Under Review"},{"id":"http://arxiv.org/abs/2310.09196v1","updated":"2023-10-13T15:42:55Z","published":"2023-10-13T15:42:55Z","title":"A 4-approximation algorithm for min max correlation clustering","summary":" We introduce a lower bounding technique for the min max correlation\nclustering problem and, based on this technique, a combinatorial\n4-approximation algorithm for complete graphs. This improves upon the previous\nbest known approximation guarantees of 5, using a linear program formulation\n(Kalhan et al., 2019), and 4, for a combinatorial algorithm (Davies et al.,\n2023). We extend this algorithm by a greedy joining heuristic and show\nempirically that it improves the state of the art in solution quality and\nruntime on several benchmark datasets.\n","authors":["Holger Heidrich","Jannik Irmai","Bjoern Andres"],"pdf_url":"https://arxiv.org/pdf/2310.09196v1.pdf","comment":"9 pages"},{"id":"http://arxiv.org/abs/2310.09194v1","updated":"2023-10-13T15:40:55Z","published":"2023-10-13T15:40:55Z","title":"Variational autoencoder with weighted samples for high-dimensional\n non-parametric adaptive importance sampling","summary":" Probability density function estimation with weighted samples is the main\nfoundation of all adaptive importance sampling algorithms. Classically, a\ntarget distribution is approximated either by a non-parametric model or within\na parametric family. However, these models suffer from the curse of\ndimensionality or from their lack of flexibility. In this contribution, we\nsuggest to use as the approximating model a distribution parameterised by a\nvariational autoencoder. We extend the existing framework to the case of\nweighted samples by introducing a new objective function. The flexibility of\nthe obtained family of distributions makes it as expressive as a non-parametric\nmodel, and despite the very high number of parameters to estimate, this family\nis much more efficient in high dimension than the classical Gaussian or\nGaussian mixture families. Moreover, in order to add flexibility to the model\nand to be able to learn multimodal distributions, we consider a learnable prior\ndistribution for the variational autoencoder latent variables. We also\nintroduce a new pre-training procedure for the variational autoencoder to find\ngood starting weights of the neural networks to prevent as much as possible the\nposterior collapse phenomenon to happen. At last, we explicit how the resulting\ndistribution can be combined with importance sampling, and we exploit the\nproposed procedure in existing adaptive importance sampling algorithms to draw\npoints from a target distribution and to estimate a rare event probability in\nhigh dimension on two multimodal problems.\n","authors":["Julien Demange-Chryst","François Bachoc","Jérôme Morio","Timothé Krauth"],"pdf_url":"https://arxiv.org/pdf/2310.09194v1.pdf","comment":"20 pages, 5 figures"},{"id":"http://arxiv.org/abs/2310.09192v1","updated":"2023-10-13T15:36:48Z","published":"2023-10-13T15:36:48Z","title":"Does Graph Distillation See Like Vision Dataset Counterpart?","summary":" Training on large-scale graphs has achieved remarkable results in graph\nrepresentation learning, but its cost and storage have attracted increasing\nconcerns. Existing graph condensation methods primarily focus on optimizing the\nfeature matrices of condensed graphs while overlooking the impact of the\nstructure information from the original graphs. To investigate the impact of\nthe structure information, we conduct analysis from the spectral domain and\nempirically identify substantial Laplacian Energy Distribution (LED) shifts in\nprevious works. Such shifts lead to poor performance in cross-architecture\ngeneralization and specific tasks, including anomaly detection and link\nprediction. In this paper, we propose a novel Structure-broadcasting Graph\nDataset Distillation (SGDD) scheme for broadcasting the original structure\ninformation to the generation of the synthetic one, which explicitly prevents\noverlooking the original structure information. Theoretically, the synthetic\ngraphs by SGDD are expected to have smaller LED shifts than previous works,\nleading to superior performance in both cross-architecture settings and\nspecific tasks. We validate the proposed SGDD across 9 datasets and achieve\nstate-of-the-art results on all of them: for example, on the YelpChi dataset,\nour approach maintains 98.6% test accuracy of training on the original graph\ndataset with 1,000 times saving on the scale of the graph. Moreover, we\nempirically evaluate there exist 17.6% ~ 31.4% reductions in LED shift crossing\n9 datasets. Extensive experiments and analysis verify the effectiveness and\nnecessity of the proposed designs. The code is available in the GitHub\nrepository: https://github.com/RingBDStack/SGDD.\n","authors":["Beining Yang","Kai Wang","Qingyun Sun","Cheng Ji","Xingcheng Fu","Hao Tang","Yang You","Jianxin Li"],"pdf_url":"https://arxiv.org/pdf/2310.09192v1.pdf","comment":"Accepted by NeurIPS 2023"},{"id":"http://arxiv.org/abs/2305.17216v3","updated":"2023-10-13T15:35:42Z","published":"2023-05-26T19:22:03Z","title":"Generating Images with Multimodal Language Models","summary":" We propose a method to fuse frozen text-only large language models (LLMs)\nwith pre-trained image encoder and decoder models, by mapping between their\nembedding spaces. Our model demonstrates a wide suite of multimodal\ncapabilities: image retrieval, novel image generation, and multimodal dialogue.\nOurs is the first approach capable of conditioning on arbitrarily interleaved\nimage and text inputs to generate coherent image (and text) outputs. To achieve\nstrong performance on image generation, we propose an efficient mapping network\nto ground the LLM to an off-the-shelf text-to-image generation model. This\nmapping network translates hidden representations of text into the embedding\nspace of the visual models, enabling us to leverage the strong text\nrepresentations of the LLM for visual outputs. Our approach outperforms\nbaseline generation models on tasks with longer and more complex language. In\naddition to novel image generation, our model is also capable of image\nretrieval from a prespecified dataset, and decides whether to retrieve or\ngenerate at inference time. This is done with a learnt decision module which\nconditions on the hidden representations of the LLM. Our model exhibits a wider\nrange of capabilities compared to prior multimodal language models. It can\nprocess image-and-text inputs, and produce retrieved images, generated images,\nand generated text -- outperforming non-LLM based generation models across\nseveral text-to-image tasks that measure context dependence.\n","authors":["Jing Yu Koh","Daniel Fried","Ruslan Salakhutdinov"],"pdf_url":"https://arxiv.org/pdf/2305.17216v3.pdf","comment":"NeurIPS 2023. Project page: http://jykoh.com/gill"},{"id":"http://arxiv.org/abs/2306.03655v2","updated":"2023-10-13T15:33:48Z","published":"2023-06-06T13:15:01Z","title":"Online Learning under Adversarial Nonlinear Constraints","summary":" In many applications, learning systems are required to process continuous\nnon-stationary data streams. We study this problem in an online learning\nframework and propose an algorithm that can deal with adversarial time-varying\nand nonlinear constraints. As we show in our work, the algorithm called\nConstraint Violation Velocity Projection (CVV-Pro) achieves $\\sqrt{T}$ regret\nand converges to the feasible set at a rate of $1/\\sqrt{T}$, despite the fact\nthat the feasible set is slowly time-varying and a priori unknown to the\nlearner. CVV-Pro only relies on local sparse linear approximations of the\nfeasible set and therefore avoids optimizing over the entire set at each\niteration, which is in sharp contrast to projected gradients or Frank-Wolfe\nmethods. We also empirically evaluate our algorithm on two-player games, where\nthe players are subjected to a shared constraint.\n","authors":["Pavel Kolev","Georg Martius","Michael Muehlebach"],"pdf_url":"https://arxiv.org/pdf/2306.03655v2.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2303.00233v3","updated":"2023-10-13T15:32:57Z","published":"2023-03-01T05:03:23Z","title":"Single-Cell Multimodal Prediction via Transformers","summary":" The recent development of multimodal single-cell technology has made the\npossibility of acquiring multiple omics data from individual cells, thereby\nenabling a deeper understanding of cellular states and dynamics. Nevertheless,\nthe proliferation of multimodal single-cell data also introduces tremendous\nchallenges in modeling the complex interactions among different modalities. The\nrecently advanced methods focus on constructing static interaction graphs and\napplying graph neural networks (GNNs) to learn from multimodal data. However,\nsuch static graphs can be suboptimal as they do not take advantage of the\ndownstream task information; meanwhile GNNs also have some inherent limitations\nwhen deeply stacking GNN layers. To tackle these issues, in this work, we\ninvestigate how to leverage transformers for multimodal single-cell data in an\nend-to-end manner while exploiting downstream task information. In particular,\nwe propose a scMoFormer framework which can readily incorporate external domain\nknowledge and model the interactions within each modality and cross modalities.\nExtensive experiments demonstrate that scMoFormer achieves superior performance\non various benchmark datasets. Remarkably, scMoFormer won a Kaggle silver medal\nwith the rank of 24/1221 (Top 2%) without ensemble in a NeurIPS 2022\ncompetition. Our implementation is publicly available at Github.\n","authors":["Wenzhuo Tang","Hongzhi Wen","Renming Liu","Jiayuan Ding","Wei Jin","Yuying Xie","Hui Liu","Jiliang Tang"],"pdf_url":"https://arxiv.org/pdf/2303.00233v3.pdf","comment":"CIKM 2023"},{"id":"http://arxiv.org/abs/2207.10062v4","updated":"2023-10-13T15:24:24Z","published":"2022-07-20T17:47:54Z","title":"DataPerf: Benchmarks for Data-Centric AI Development","summary":" Machine learning research has long focused on models rather than datasets,\nand prominent datasets are used for common ML tasks without regard to the\nbreadth, difficulty, and faithfulness of the underlying problems. Neglecting\nthe fundamental importance of data has given rise to inaccuracy, bias, and\nfragility in real-world applications, and research is hindered by saturation\nacross existing dataset benchmarks. In response, we present DataPerf, a\ncommunity-led benchmark suite for evaluating ML datasets and data-centric\nalgorithms. We aim to foster innovation in data-centric AI through competition,\ncomparability, and reproducibility. We enable the ML community to iterate on\ndatasets, instead of just architectures, and we provide an open, online\nplatform with multiple rounds of challenges to support this iterative\ndevelopment. The first iteration of DataPerf contains five benchmarks covering\na wide spectrum of data-centric techniques, tasks, and modalities in vision,\nspeech, acquisition, debugging, and diffusion prompting, and we support hosting\nnew contributed benchmarks from the community. The benchmarks, online\nevaluation platform, and baseline implementations are open source, and the\nMLCommons Association will maintain DataPerf to ensure long-term benefits to\nacademia and industry.\n","authors":["Mark Mazumder","Colby Banbury","Xiaozhe Yao","Bojan Karlaš","William Gaviria Rojas","Sudnya Diamos","Greg Diamos","Lynn He","Alicia Parrish","Hannah Rose Kirk","Jessica Quaye","Charvi Rastogi","Douwe Kiela","David Jurado","David Kanter","Rafael Mosquera","Juan Ciro","Lora Aroyo","Bilge Acun","Lingjiao Chen","Mehul Smriti Raje","Max Bartolo","Sabri Eyuboglu","Amirata Ghorbani","Emmett Goodman","Oana Inel","Tariq Kane","Christine R. Kirkpatrick","Tzu-Sheng Kuo","Jonas Mueller","Tristan Thrush","Joaquin Vanschoren","Margaret Warren","Adina Williams","Serena Yeung","Newsha Ardalani","Praveen Paritosh","Lilith Bat-Leah","Ce Zhang","James Zou","Carole-Jean Wu","Cody Coleman","Andrew Ng","Peter Mattson","Vijay Janapa Reddi"],"pdf_url":"https://arxiv.org/pdf/2207.10062v4.pdf","comment":"NeurIPS 2023 Datasets and Benchmarks Track"},{"id":"http://arxiv.org/abs/2310.09183v1","updated":"2023-10-13T15:21:25Z","published":"2023-10-13T15:21:25Z","title":"PRIOR: Personalized Prior for Reactivating the Information Overlooked in\n Federated Learning","summary":" Classical federated learning (FL) enables training machine learning models\nwithout sharing data for privacy preservation, but heterogeneous data\ncharacteristic degrades the performance of the localized model. Personalized FL\n(PFL) addresses this by synthesizing personalized models from a global model\nvia training on local data. Such a global model may overlook the specific\ninformation that the clients have been sampled. In this paper, we propose a\nnovel scheme to inject personalized prior knowledge into the global model in\neach client, which attempts to mitigate the introduced incomplete information\nproblem in PFL. At the heart of our proposed approach is a framework, the PFL\nwith Bregman Divergence (pFedBreD), decoupling the personalized prior from the\nlocal objective function regularized by Bregman divergence for greater\nadaptability in personalized scenarios. We also relax the mirror descent (RMD)\nto extract the prior explicitly to provide optional strategies. Additionally,\nour pFedBreD is backed up by a convergence analysis. Sufficient experiments\ndemonstrate that our method reaches the state-of-the-art performances on 5\ndatasets and outperforms other methods by up to 3.5% across 8 benchmarks.\nExtensive analyses verify the robustness and necessity of proposed designs.\n","authors":["Mingjia Shi","Yuhao Zhou","Kai Wang","Huaizheng Zhang","Shudong Huang","Qing Ye","Jiangcheng Lv"],"pdf_url":"https://arxiv.org/pdf/2310.09183v1.pdf","comment":"This paper is accepted by NeurIPS 2023"},{"id":"http://arxiv.org/abs/2308.02490v2","updated":"2023-10-13T15:16:59Z","published":"2023-08-04T17:59:47Z","title":"MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities","summary":" We propose MM-Vet, an evaluation benchmark that examines large multimodal\nmodels (LMMs) on complicated multimodal tasks. Recent LMMs have shown various\nintriguing abilities, such as solving math problems written on the blackboard,\nreasoning about events and celebrities in news images, and explaining visual\njokes. Rapid model advancements pose challenges to evaluation benchmark\ndevelopment. Problems include: (1) How to systematically structure and evaluate\nthe complicated multimodal tasks; (2) How to design evaluation metrics that\nwork well across question and answer types; and (3) How to give model insights\nbeyond a simple performance ranking. To this end, we present MM-Vet, designed\nbased on the insight that the intriguing ability to solve complicated tasks is\noften achieved by a generalist model being able to integrate different core\nvision-language (VL) capabilities. MM-Vet defines 6 core VL capabilities and\nexamines the 16 integrations of interest derived from the capability\ncombination. For evaluation metrics, we propose an LLM-based evaluator for\nopen-ended outputs. The evaluator enables the evaluation across different\nquestion types and answer styles, resulting in a unified scoring metric. We\nevaluate representative LMMs on MM-Vet, providing insights into the\ncapabilities of different LMM system paradigms and models. Code and data are\navailable at https://github.com/yuweihao/MM-Vet.\n","authors":["Weihao Yu","Zhengyuan Yang","Linjie Li","Jianfeng Wang","Kevin Lin","Zicheng Liu","Xinchao Wang","Lijuan Wang"],"pdf_url":"https://arxiv.org/pdf/2308.02490v2.pdf","comment":"Update results of OpenFlamingo-9B (MPT), LLaMA-Adapter v2-7B, and\n Otter-9B (MPT). Code, data and leaderboard:\n https://github.com/yuweihao/MM-Vet"},{"id":"http://arxiv.org/abs/2310.09167v1","updated":"2023-10-13T15:01:55Z","published":"2023-10-13T15:01:55Z","title":"A Deep Neural Network -- Mechanistic Hybrid Model to Predict\n Pharmacokinetics in Rat","summary":" An important aspect in the development of small molecules as drugs or\nagro-chemicals is their systemic availability after intravenous and oral\nadministration.The prediction of the systemic availability from the chemical\nstructure of a poten-tial candidate is highly desirable, as it allows to focus\nthe drug or agrochemicaldevelopment on compounds with a favorable kinetic\nprofile. However, such pre-dictions are challenging as the availability is the\nresult of the complex interplaybetween molecular properties, biology and\nphysiology and training data is rare.In this work we improve the hybrid model\ndeveloped earlier [34]. We reducethe median fold change error for the total\noral exposure from 2.85 to 2.35 andfor intravenous administration from 1.95 to\n1.62. This is achieved by trainingon a larger data set, improving the neural\nnetwork architecture as well as theparametrization of mechanistic model.\nFurther, we extend our approach to predictadditional endpoints and to handle\ndifferent covariates, like sex and dosage form.In contrast to a pure machine\nlearning model, our model is able to predict newend points on which it has not\nbeen trained. We demonstrate this feature by1predicting the exposure over the\nfirst 24h, while the model has only been trainedon the total exposure.\n","authors":["Florian Führer","Andrea Gruber","Holger Diedam","Andreas H. Göller","Stephan Menz","Sebastian Schneckener"],"pdf_url":"https://arxiv.org/pdf/2310.09167v1.pdf","comment":"Journal of Computer-Aided Molecular Design"},{"id":"http://arxiv.org/abs/2310.09162v1","updated":"2023-10-13T14:56:38Z","published":"2023-10-13T14:56:38Z","title":"Quantum Machine Learning in Climate Change and Sustainability: a Review","summary":" Climate change and its impact on global sustainability are critical\nchallenges, demanding innovative solutions that combine cutting-edge\ntechnologies and scientific insights. Quantum machine learning (QML) has\nemerged as a promising paradigm that harnesses the power of quantum computing\nto address complex problems in various domains including climate change and\nsustainability. In this work, we survey existing literature that applies\nquantum machine learning to solve climate change and sustainability-related\nproblems. We review promising QML methodologies that have the potential to\naccelerate decarbonization including energy systems, climate data forecasting,\nclimate monitoring, and hazardous events predictions. We discuss the challenges\nand current limitations of quantum machine learning approaches and provide an\noverview of potential opportunities and future work to leverage QML-based\nmethods in the important area of climate change research.\n","authors":["Amal Nammouchi","Andreas Kassler","Andreas Theorachis"],"pdf_url":"https://arxiv.org/pdf/2310.09162v1.pdf","comment":"8 pages Accepted for publication in AAAI proceedings (AAAI Fall\n symposium 2023)"},{"id":"http://arxiv.org/abs/2310.09163v1","updated":"2023-10-13T14:56:38Z","published":"2023-10-13T14:56:38Z","title":"Jointly-Learned Exit and Inference for a Dynamic Neural Network :\n JEI-DNN","summary":" Large pretrained models, coupled with fine-tuning, are slowly becoming\nestablished as the dominant architecture in machine learning. Even though these\nmodels offer impressive performance, their practical application is often\nlimited by the prohibitive amount of resources required for every inference.\nEarly-exiting dynamic neural networks (EDNN) circumvent this issue by allowing\na model to make some of its predictions from intermediate layers (i.e.,\nearly-exit). Training an EDNN architecture is challenging as it consists of two\nintertwined components: the gating mechanism (GM) that controls early-exiting\ndecisions and the intermediate inference modules (IMs) that perform inference\nfrom intermediate representations. As a result, most existing approaches rely\non thresholding confidence metrics for the gating mechanism and strive to\nimprove the underlying backbone network and the inference modules. Although\nsuccessful, this approach has two fundamental shortcomings: 1) the GMs and the\nIMs are decoupled during training, leading to a train-test mismatch; and 2) the\nthresholding gating mechanism introduces a positive bias into the predictive\nprobabilities, making it difficult to readily extract uncertainty information.\nWe propose a novel architecture that connects these two modules. This leads to\nsignificant performance improvements on classification datasets and enables\nbetter uncertainty characterization capabilities.\n","authors":["Florence Regol","Joud Chataoui","Mark Coates"],"pdf_url":"https://arxiv.org/pdf/2310.09163v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09157v1","updated":"2023-10-13T14:52:46Z","published":"2023-10-13T14:52:46Z","title":"The Computational Complexity of Finding Stationary Points in Non-Convex\n Optimization","summary":" Finding approximate stationary points, i.e., points where the gradient is\napproximately zero, of non-convex but smooth objective functions $f$ over\nunrestricted $d$-dimensional domains is one of the most fundamental problems in\nclassical non-convex optimization. Nevertheless, the computational and query\ncomplexity of this problem are still not well understood when the dimension $d$\nof the problem is independent of the approximation error. In this paper, we\nshow the following computational and query complexity results:\n 1. The problem of finding approximate stationary points over unrestricted\ndomains is PLS-complete.\n 2. For $d = 2$, we provide a zero-order algorithm for finding\n$\\varepsilon$-approximate stationary points that requires at most\n$O(1/\\varepsilon)$ value queries to the objective function.\n 3. We show that any algorithm needs at least $\\Omega(1/\\varepsilon)$ queries\nto the objective function and/or its gradient to find $\\varepsilon$-approximate\nstationary points when $d=2$. Combined with the above, this characterizes the\nquery complexity of this problem to be $\\Theta(1/\\varepsilon)$.\n 4. For $d = 2$, we provide a zero-order algorithm for finding\n$\\varepsilon$-KKT points in constrained optimization problems that requires at\nmost $O(1/\\sqrt{\\varepsilon})$ value queries to the objective function. This\ncloses the gap between the works of Bubeck and Mikulincer [2020] and Vavasis\n[1993] and characterizes the query complexity of this problem to be\n$\\Theta(1/\\sqrt{\\varepsilon})$.\n 5. Combining our results with the recent result of Fearnley et al. [2022], we\nshow that finding approximate KKT points in constrained optimization is\nreducible to finding approximate stationary points in unconstrained\noptimization but the converse is impossible.\n","authors":["Alexandros Hollender","Manolis Zampetakis"],"pdf_url":"https://arxiv.org/pdf/2310.09157v1.pdf","comment":"Full version of COLT 2023 extended abstract"},{"id":"http://arxiv.org/abs/2310.09149v1","updated":"2023-10-13T14:43:11Z","published":"2023-10-13T14:43:11Z","title":"Lattice Approximations in Wasserstein Space","summary":" We consider structured approximation of measures in Wasserstein space\n$W_p(\\mathbb{R}^d)$ for $p\\in[1,\\infty)$ by discrete and piecewise constant\nmeasures based on a scaled Voronoi partition of $\\mathbb{R}^d$. We show that if\na full rank lattice $\\Lambda$ is scaled by a factor of $h\\in(0,1]$, then\napproximation of a measure based on the Voronoi partition of $h\\Lambda$ is\n$O(h)$ regardless of $d$ or $p$. We then use a covering argument to show that\n$N$-term approximations of compactly supported measures is $O(N^{-\\frac1d})$\nwhich matches known rates for optimal quantizers and empirical measure\napproximation in most instances. Finally, we extend these results to\nnoncompactly supported measures with sufficient decay.\n","authors":["Keaton Hamm","Varun Khurana"],"pdf_url":"https://arxiv.org/pdf/2310.09149v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09144v1","updated":"2023-10-13T14:35:59Z","published":"2023-10-13T14:35:59Z","title":"Goodhart's Law in Reinforcement Learning","summary":" Implementing a reward function that perfectly captures a complex task in the\nreal world is impractical. As a result, it is often appropriate to think of the\nreward function as a proxy for the true objective rather than as its\ndefinition. We study this phenomenon through the lens of Goodhart's law, which\npredicts that increasing optimisation of an imperfect proxy beyond some\ncritical point decreases performance on the true objective. First, we propose a\nway to quantify the magnitude of this effect and show empirically that\noptimising an imperfect proxy reward often leads to the behaviour predicted by\nGoodhart's law for a wide range of environments and reward functions. We then\nprovide a geometric explanation for why Goodhart's law occurs in Markov\ndecision processes. We use these theoretical insights to propose an optimal\nearly stopping method that provably avoids the aforementioned pitfall and\nderive theoretical regret bounds for this method. Moreover, we derive a\ntraining method that maximises worst-case reward, for the setting where there\nis uncertainty about the true reward function. Finally, we evaluate our early\nstopping method experimentally. Our results support a foundation for a\ntheoretically-principled study of reinforcement learning under reward\nmisspecification.\n","authors":["Jacek Karwowski","Oliver Hayman","Xingjian Bai","Klaus Kiendlhofer","Charlie Griffin","Joar Skalse"],"pdf_url":"https://arxiv.org/pdf/2310.09144v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09139v1","updated":"2023-10-13T14:27:21Z","published":"2023-10-13T14:27:21Z","title":"The Consensus Game: Language Model Generation via Equilibrium Search","summary":" When applied to question answering and other text generation tasks, language\nmodels (LMs) may be queried generatively (by sampling answers from their output\ndistribution) or discriminatively (by using them to score or rank a set of\ncandidate outputs). These procedures sometimes yield very different\npredictions. How do we reconcile mutually incompatible scoring procedures to\nobtain coherent LM predictions? We introduce a new, a training-free,\ngame-theoretic procedure for language model decoding. Our approach casts\nlanguage model decoding as a regularized imperfect-information sequential\nsignaling game - which we term the CONSENSUS GAME - in which a GENERATOR seeks\nto communicate an abstract correctness parameter using natural language\nsentences to a DISCRIMINATOR. We develop computational procedures for finding\napproximate equilibria of this game, resulting in a decoding algorithm we call\nEQUILIBRIUM-RANKING. Applied to a large number of tasks (including reading\ncomprehension, commonsense reasoning, mathematical problem-solving, and\ndialog), EQUILIBRIUM-RANKING consistently, and sometimes substantially,\nimproves performance over existing LM decoding procedures - on multiple\nbenchmarks, we observe that applying EQUILIBRIUM-RANKING to LLaMA-7B\noutperforms the much larger LLaMA-65B and PaLM-540B models. These results\nhighlight the promise of game-theoretic tools for addressing fundamental\nchallenges of truthfulness and consistency in LMs.\n","authors":["Athul Paul Jacob","Yikang Shen","Gabriele Farina","Jacob Andreas"],"pdf_url":"https://arxiv.org/pdf/2310.09139v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07321v2","updated":"2023-10-13T14:24:31Z","published":"2023-10-11T09:09:55Z","title":"On the Impact of Cross-Domain Data on German Language Models","summary":" Traditionally, large language models have been either trained on general web\ncrawls or domain-specific data. However, recent successes of generative large\nlanguage models, have shed light on the benefits of cross-domain datasets. To\nexamine the significance of prioritizing data diversity over quality, we\npresent a German dataset comprising texts from five domains, along with another\ndataset aimed at containing high-quality data. Through training a series of\nmodels ranging between 122M and 750M parameters on both datasets, we conduct a\ncomprehensive benchmark on multiple downstream tasks. Our findings demonstrate\nthat the models trained on the cross-domain dataset outperform those trained on\nquality data alone, leading to improvements up to $4.45\\%$ over the previous\nstate-of-the-art. The models are available at\nhttps://huggingface.co/ikim-uk-essen\n","authors":["Amin Dada","Aokun Chen","Cheng Peng","Kaleb E Smith","Ahmad Idrissi-Yaghir","Constantin Marc Seibold","Jianning Li","Lars Heiliger","Xi Yang","Christoph M. Friedrich","Daniel Truhn","Jan Egger","Jiang Bian","Jens Kleesiek","Yonghui Wu"],"pdf_url":"https://arxiv.org/pdf/2310.07321v2.pdf","comment":"13 pages, 1 figure, accepted at Findings of the Association for\n Computational Linguistics: EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.09129v1","updated":"2023-10-13T14:17:25Z","published":"2023-10-13T14:17:25Z","title":"Computing Marginal and Conditional Divergences between Decomposable\n Models with Applications","summary":" The ability to compute the exact divergence between two high-dimensional\ndistributions is useful in many applications but doing so naively is\nintractable. Computing the alpha-beta divergence -- a family of divergences\nthat includes the Kullback-Leibler divergence and Hellinger distance -- between\nthe joint distribution of two decomposable models, i.e chordal Markov networks,\ncan be done in time exponential in the treewidth of these models. However,\nreducing the dissimilarity between two high-dimensional objects to a single\nscalar value can be uninformative. Furthermore, in applications such as\nsupervised learning, the divergence over a conditional distribution might be of\nmore interest. Therefore, we propose an approach to compute the exact\nalpha-beta divergence between any marginal or conditional distribution of two\ndecomposable models. Doing so tractably is non-trivial as we need to decompose\nthe divergence between these distributions and therefore, require a\ndecomposition over the marginal and conditional distributions of these models.\nConsequently, we provide such a decomposition and also extend existing work to\ncompute the marginal and conditional alpha-beta divergence between these\ndecompositions. We then show how our method can be used to analyze\ndistributional changes by first applying it to a benchmark image dataset.\nFinally, based on our framework, we propose a novel way to quantify the error\nin contemporary superconducting quantum computers. Code for all experiments is\navailable at: https://lklee.dev/pub/2023-icdm/code\n","authors":["Loong Kuan Lee","Geoffrey I. Webb","Daniel F. Schmidt","Nico Piatkowski"],"pdf_url":"https://arxiv.org/pdf/2310.09129v1.pdf","comment":"10 pages, 8 figures, Accepted at the IEEE International Conference on\n Data Mining (ICDM) 2023"},{"id":"http://arxiv.org/abs/2310.09127v1","updated":"2023-10-13T14:15:54Z","published":"2023-10-13T14:15:54Z","title":"On Generalization Bounds for Projective Clustering","summary":" Given a set of points, clustering consists of finding a partition of a point\nset into $k$ clusters such that the center to which a point is assigned is as\nclose as possible. Most commonly, centers are points themselves, which leads to\nthe famous $k$-median and $k$-means objectives. One may also choose centers to\nbe $j$ dimensional subspaces, which gives rise to subspace clustering. In this\npaper, we consider learning bounds for these problems. That is, given a set of\n$n$ samples $P$ drawn independently from some unknown, but fixed distribution\n$\\mathcal{D}$, how quickly does a solution computed on $P$ converge to the\noptimal clustering of $\\mathcal{D}$? We give several near optimal results. In\nparticular,\n For center-based objectives, we show a convergence rate of\n$\\tilde{O}\\left(\\sqrt{{k}/{n}}\\right)$. This matches the known optimal bounds\nof [Fefferman, Mitter, and Narayanan, Journal of the Mathematical Society 2016]\nand [Bartlett, Linder, and Lugosi, IEEE Trans. Inf. Theory 1998] for $k$-means\nand extends it to other important objectives such as $k$-median.\n For subspace clustering with $j$-dimensional subspaces, we show a convergence\nrate of $\\tilde{O}\\left(\\sqrt{\\frac{kj^2}{n}}\\right)$. These are the first\nprovable bounds for most of these problems. For the specific case of projective\nclustering, which generalizes $k$-means, we show a convergence rate of\n$\\Omega\\left(\\sqrt{\\frac{kj}{n}}\\right)$ is necessary, thereby proving that the\nbounds from [Fefferman, Mitter, and Narayanan, Journal of the Mathematical\nSociety 2016] are essentially optimal.\n","authors":["Maria Sofia Bucarelli","Matilde Fjeldsø Larsen","Chris Schwiegelshohn","Mads Bech Toftrup"],"pdf_url":"https://arxiv.org/pdf/2310.09127v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09126v1","updated":"2023-10-13T14:14:43Z","published":"2023-10-13T14:14:43Z","title":"Physics-guided Noise Neural Proxy for Low-light Raw Image Denoising","summary":" Low-light raw image denoising plays a crucial role in mobile photography, and\nlearning-based methods have become the mainstream approach. Training the\nlearning-based methods with synthetic data emerges as an efficient and\npractical alternative to paired real data. However, the quality of synthetic\ndata is inherently limited by the low accuracy of the noise model, which\ndecreases the performance of low-light raw image denoising. In this paper, we\ndevelop a novel framework for accurate noise modeling that learns a\nphysics-guided noise neural proxy (PNNP) from dark frames. PNNP integrates\nthree efficient techniques: physics-guided noise decoupling (PND),\nphysics-guided proxy model (PPM), and differentiable distribution-oriented loss\n(DDL). The PND decouples the dark frame into different components and handles\ndifferent levels of noise in a flexible manner, which reduces the complexity of\nthe noise neural proxy. The PPM incorporates physical priors to effectively\nconstrain the generated noise, which promotes the accuracy of the noise neural\nproxy. The DDL provides explicit and reliable supervision for noise modeling,\nwhich promotes the precision of the noise neural proxy. Extensive experiments\non public low-light raw image denoising datasets and real low-light imaging\nscenarios demonstrate the superior performance of our PNNP framework.\n","authors":["Hansen Feng","Lizhi Wang","Yiqi Huang","Yuzhi Wang","Hua Huang"],"pdf_url":"https://arxiv.org/pdf/2310.09126v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09125v1","updated":"2023-10-13T14:14:00Z","published":"2023-10-13T14:14:00Z","title":"Training and Predicting Visual Error for Real-Time Applications","summary":" Visual error metrics play a fundamental role in the quantification of\nperceived image similarity. Most recently, use cases for them in real-time\napplications have emerged, such as content-adaptive shading and shading reuse\nto increase performance and improve efficiency. A wide range of different\nmetrics has been established, with the most sophisticated being capable of\ncapturing the perceptual characteristics of the human visual system. However,\ntheir complexity, computational expense, and reliance on reference images to\ncompare against prevent their generalized use in real-time, restricting such\napplications to using only the simplest available metrics. In this work, we\nexplore the abilities of convolutional neural networks to predict a variety of\nvisual metrics without requiring either reference or rendered images.\nSpecifically, we train and deploy a neural network to estimate the visual error\nresulting from reusing shading or using reduced shading rates. The resulting\nmodels account for 70%-90% of the variance while achieving up to an order of\nmagnitude faster computation times. Our solution combines image-space\ninformation that is readily available in most state-of-the-art deferred shading\npipelines with reprojection from previous frames to enable an adequate estimate\nof visual errors, even in previously unseen regions. We describe a suitable\nconvolutional network architecture and considerations for data preparation for\ntraining. We demonstrate the capability of our network to predict complex error\nmetrics at interactive rates in a real-time application that implements\ncontent-adaptive shading in a deferred pipeline. Depending on the portion of\nunseen image regions, our approach can achieve up to $2\\times$ performance\ncompared to state-of-the-art methods.\n","authors":["João Libório Cardoso","Bernhard Kerbl","Lei Yang","Yury Uralsky","Michael Wimmer"],"pdf_url":"https://arxiv.org/pdf/2310.09125v1.pdf","comment":"Published at Proceedings of the ACM in Computer Graphics and\n Interactive Techniques. 14 Pages, 16 Figures, 3 Tables. For paper website and\n higher quality figures, see https://jaliborc.github.io/rt-percept/"},{"id":"http://arxiv.org/abs/2310.09123v1","updated":"2023-10-13T14:13:02Z","published":"2023-10-13T14:13:02Z","title":"Automatic Music Playlist Generation via Simulation-based Reinforcement\n Learning","summary":" Personalization of playlists is a common feature in music streaming services,\nbut conventional techniques, such as collaborative filtering, rely on explicit\nassumptions regarding content quality to learn how to make recommendations.\nSuch assumptions often result in misalignment between offline model objectives\nand online user satisfaction metrics. In this paper, we present a reinforcement\nlearning framework that solves for such limitations by directly optimizing for\nuser satisfaction metrics via the use of a simulated playlist-generation\nenvironment. Using this simulator we develop and train a modified Deep\nQ-Network, the action head DQN (AH-DQN), in a manner that addresses the\nchallenges imposed by the large state and action space of our RL formulation.\nThe resulting policy is capable of making recommendations from large and\ndynamic sets of candidate items with the expectation of maximizing consumption\nmetrics. We analyze and evaluate agents offline via simulations that use\nenvironment models trained on both public and proprietary streaming datasets.\nWe show how these agents lead to better user-satisfaction metrics compared to\nbaseline methods during online A/B tests. Finally, we demonstrate that\nperformance assessments produced from our simulator are strongly correlated\nwith observed online metric results.\n","authors":["Federico Tomasi","Joseph Cauteruccio","Surya Kanoria","Kamil Ciosek","Matteo Rinaldi","Zhenwen Dai"],"pdf_url":"https://arxiv.org/pdf/2310.09123v1.pdf","comment":"10 pages. KDD 23"},{"id":"http://arxiv.org/abs/2310.09118v1","updated":"2023-10-13T14:03:01Z","published":"2023-10-13T14:03:01Z","title":"DSG: An End-to-End Document Structure Generator","summary":" Information in industry, research, and the public sector is widely stored as\nrendered documents (e.g., PDF files, scans). Hence, to enable downstream tasks,\nsystems are needed that map rendered documents onto a structured hierarchical\nformat. However, existing systems for this task are limited by heuristics and\nare not end-to-end trainable. In this work, we introduce the Document Structure\nGenerator (DSG), a novel system for document parsing that is fully end-to-end\ntrainable. DSG combines a deep neural network for parsing (i) entities in\ndocuments (e.g., figures, text blocks, headers, etc.) and (ii) relations that\ncapture the sequence and nested structure between entities. Unlike existing\nsystems that rely on heuristics, our DSG is trained end-to-end, making it\neffective and flexible for real-world applications. We further contribute a\nnew, large-scale dataset called E-Periodica comprising real-world magazines\nwith complex document structures for evaluation. Our results demonstrate that\nour DSG outperforms commercial OCR tools and, on top of that, achieves\nstate-of-the-art performance. To the best of our knowledge, our DSG system is\nthe first end-to-end trainable system for hierarchical document parsing.\n","authors":["Johannes Rausch","Gentiana Rashiti","Maxim Gusev","Ce Zhang","Stefan Feuerriegel"],"pdf_url":"https://arxiv.org/pdf/2310.09118v1.pdf","comment":"Accepted at ICDM 2023"},{"id":"http://arxiv.org/abs/2306.09675v3","updated":"2023-10-13T13:48:58Z","published":"2023-06-16T08:13:41Z","title":"Multi-View Class Incremental Learning","summary":" Multi-view learning (MVL) has gained great success in integrating information\nfrom multiple perspectives of a dataset to improve downstream task performance.\nTo make MVL methods more practical in an open-ended environment, this paper\ninvestigates a novel paradigm called multi-view class incremental learning\n(MVCIL), where a single model incrementally classifies new classes from a\ncontinual stream of views, requiring no access to earlier views of data.\nHowever, MVCIL is challenged by the catastrophic forgetting of old information\nand the interference with learning new concepts. To address this, we first\ndevelop a randomization-based representation learning technique serving for\nfeature extraction to guarantee their separate view-optimal working states,\nduring which multiple views belonging to a class are presented sequentially;\nThen, we integrate them one by one in the orthogonality fusion subspace spanned\nby the extracted features; Finally, we introduce selective weight consolidation\nfor learning-without-forgetting decision-making while encountering new classes.\nExtensive experiments on synthetic and real-world datasets validate the\neffectiveness of our approach.\n","authors":["Depeng Li","Tianqi Wang","Junwei Chen","Kenji Kawaguchi","Cheng Lian","Zhigang Zeng"],"pdf_url":"https://arxiv.org/pdf/2306.09675v3.pdf","comment":"Accepted to Information Fusion"},{"id":"http://arxiv.org/abs/2309.03004v3","updated":"2023-10-13T13:34:45Z","published":"2023-09-06T13:48:40Z","title":"A Theoretical Explanation of Activation Sparsity through Flat Minima and\n Adversarial Robustness","summary":" A recent empirical observation (Li et al., 2022b) of activation sparsity in\nMLP blocks offers an opportunity to drastically reduce computation costs for\nfree. Although having attributed it to training dynamics, existing theoretical\nexplanations of activation sparsity are restricted to shallow networks, small\ntraining steps and special training, despite its emergence in deep models\nstandardly trained for a large number of steps. To fill these gaps, we propose\nthe notion of gradient sparsity as one source of activation sparsity and a\ntheoretical explanation based on it that sees sparsity a necessary step to\nadversarial robustness w.r.t. hidden features and parameters, which is\napproximately the flatness of minima for well-learned models. The theory\napplies to standardly trained LayerNorm-ed MLPs, and further to Transformers or\nother architectures trained with weight noises. Eliminating other sources of\nflatness except for sparsity, we discover the phenomenon that the ratio between\nthe largest and smallest non-zero singular values of weight matrices is small.\nWhen discussing the emergence of this spectral concentration, we use random\nmatrix theory (RMT) as a powerful tool to analyze stochastic gradient noises.\nValidational experiments are conducted to verify our gradient-sparsity-based\nexplanation. We propose two plug-and-play modules for both training and\nfinetuning for sparsity. Experiments on ImageNet-1k and C4 demonstrate their\n50% sparsity improvements, indicating further potential cost reduction in both\ntraining and inference.\n","authors":["Ze Peng","Lei Qi","Yinghuan Shi","Yang Gao"],"pdf_url":"https://arxiv.org/pdf/2309.03004v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09091v1","updated":"2023-10-13T13:22:05Z","published":"2023-10-13T13:22:05Z","title":"Insightful analysis of historical sources at scales beyond human\n capabilities using unsupervised Machine Learning and XAI","summary":" Historical materials are abundant. Yet, piecing together how human knowledge\nhas evolved and spread both diachronically and synchronically remains a\nchallenge that can so far only be very selectively addressed. The vast volume\nof materials precludes comprehensive studies, given the restricted number of\nhuman specialists. However, as large amounts of historical materials are now\navailable in digital form there is a promising opportunity for AI-assisted\nhistorical analysis. In this work, we take a pivotal step towards analyzing\nvast historical corpora by employing innovative machine learning (ML)\ntechniques, enabling in-depth historical insights on a grand scale. Our study\ncenters on the evolution of knowledge within the `Sacrobosco Collection' -- a\ndigitized collection of 359 early modern printed editions of textbooks on\nastronomy used at European universities between 1472 and 1650 -- roughly 76,000\npages, many of which contain astronomic, computational tables. An ML based\nanalysis of these tables helps to unveil important facets of the\nspatio-temporal evolution of knowledge and innovation in the field of\nmathematical astronomy in the period, as taught at European universities.\n","authors":["Oliver Eberle","Jochen Büttner","Hassan El-Hajj","Grégoire Montavon","Klaus-Robert Müller","Matteo Valleriani"],"pdf_url":"https://arxiv.org/pdf/2310.09091v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2107.06755v2","updated":"2023-10-13T13:19:31Z","published":"2021-07-14T15:09:27Z","title":"DIT4BEARs Smart Roads Internship","summary":" The research internship at UiT - The Arctic University of Norway was offered\nfor our team being the winner of the 'Smart Roads - Winter Road Maintenance\n2021' Hackathon. The internship commenced on 3 May 2021 and ended on 21 May\n2021 with meetings happening twice each week. In spite of having different\nnationalities and educational backgrounds, we both interns tried to collaborate\nas a team as much as possible. The most alluring part was working on this\nproject made us realize the critical conditions faced by the arctic people,\nwhere it was hard to gain such a unique experience from our residence. We\ndeveloped and implemented several deep learning models to classify the states\n(dry, moist, wet, icy, snowy, slushy). Depending upon the best model, the\nweather forecast app will predict the state taking the Ta, Tsurf, Height,\nSpeed, Water, etc. into consideration. The crucial part was to define a safety\nmetric which is the product of the accident rates based on friction and the\naccident rates based on states. We developed a regressor that will predict the\nsafety metric depending upon the state obtained from the classifier and the\nfriction obtained from the sensor data. A pathfinding algorithm has been\ndesigned using the sensor data, open street map data, weather data.\n","authors":["Md Abrar Jahin","Andrii Krutsylo"],"pdf_url":"https://arxiv.org/pdf/2107.06755v2.pdf","comment":"6 pages"},{"id":"http://arxiv.org/abs/2306.10474v2","updated":"2023-10-13T13:05:26Z","published":"2023-06-18T04:34:17Z","title":"A Universal Semantic-Geometric Representation for Robotic Manipulation","summary":" Robots rely heavily on sensors, especially RGB and depth cameras, to perceive\nand interact with the world. RGB cameras record 2D images with rich semantic\ninformation while missing precise spatial information. On the other side, depth\ncameras offer critical 3D geometry data but capture limited semantics.\nTherefore, integrating both modalities is crucial for learning representations\nfor robotic perception and control. However, current research predominantly\nfocuses on only one of these modalities, neglecting the benefits of\nincorporating both. To this end, we present $\\textbf{Semantic-Geometric\nRepresentation} (\\textbf{SGR})$, a universal perception module for robotics\nthat leverages the rich semantic information of large-scale pre-trained 2D\nmodels and inherits the merits of 3D spatial reasoning. Our experiments\ndemonstrate that SGR empowers the agent to successfully complete a diverse\nrange of simulated and real-world robotic manipulation tasks, outperforming\nstate-of-the-art methods significantly in both single-task and multi-task\nsettings. Furthermore, SGR possesses the capability to generalize to novel\nsemantic attributes, setting it apart from the other methods. Project website:\nhttps://semantic-geometric-representation.github.io.\n","authors":["Tong Zhang","Yingdong Hu","Hanchen Cui","Hang Zhao","Yang Gao"],"pdf_url":"https://arxiv.org/pdf/2306.10474v2.pdf","comment":"CoRL 2023. Project website:\n https://semantic-geometric-representation.github.io"},{"id":"http://arxiv.org/abs/2303.15833v2","updated":"2023-10-13T12:49:35Z","published":"2023-03-28T09:05:15Z","title":"Complementary Domain Adaptation and Generalization for Unsupervised\n Continual Domain Shift Learning","summary":" Continual domain shift poses a significant challenge in real-world\napplications, particularly in situations where labeled data is not available\nfor new domains. The challenge of acquiring knowledge in this problem setting\nis referred to as unsupervised continual domain shift learning. Existing\nmethods for domain adaptation and generalization have limitations in addressing\nthis issue, as they focus either on adapting to a specific domain or\ngeneralizing to unseen domains, but not both. In this paper, we propose\nComplementary Domain Adaptation and Generalization (CoDAG), a simple yet\neffective learning framework that combines domain adaptation and generalization\nin a complementary manner to achieve three major goals of unsupervised\ncontinual domain shift learning: adapting to a current domain, generalizing to\nunseen domains, and preventing forgetting of previously seen domains. Our\napproach is model-agnostic, meaning that it is compatible with any existing\ndomain adaptation and generalization algorithms. We evaluate CoDAG on several\nbenchmark datasets and demonstrate that our model outperforms state-of-the-art\nmodels in all datasets and evaluation metrics, highlighting its effectiveness\nand robustness in handling unsupervised continual domain shift learning.\n","authors":["Wonguk Cho","Jinha Park","Taesup Kim"],"pdf_url":"https://arxiv.org/pdf/2303.15833v2.pdf","comment":"ICCV 2023"},{"id":"http://arxiv.org/abs/2310.09071v1","updated":"2023-10-13T12:45:52Z","published":"2023-10-13T12:45:52Z","title":"Online Relocating and Matching of Ride-Hailing Services: A Model-Based\n Modular Approach","summary":" This study proposes an innovative model-based modular approach (MMA) to\ndynamically optimize order matching and vehicle relocation in a ride-hailing\nplatform. MMA utilizes a two-layer and modular modeling structure. The upper\nlayer determines the spatial transfer patterns of vehicle flow within the\nsystem to maximize the total revenue of the current and future stages. With the\nguidance provided by the upper layer, the lower layer performs rapid\nvehicle-to-order matching and vehicle relocation. MMA is interpretable, and\nequipped with the customized and polynomial-time algorithm, which, as an online\norder-matching and vehicle-relocation algorithm, can scale past thousands of\nvehicles. We theoretically prove that the proposed algorithm can achieve the\nglobal optimum in stylized networks, while the numerical experiments based on\nboth the toy network and realistic dataset demonstrate that MMA is capable of\nachieving superior systematic performance compared to batch matching and\nreinforcement-learning based methods. Moreover, its modular and lightweight\nmodeling structure further enables it to achieve a high level of robustness\nagainst demand variation while maintaining a relatively low computational cost.\n","authors":["Chang Gao","Xi Lin","Fang He","Xindi Tang"],"pdf_url":"https://arxiv.org/pdf/2310.09071v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09044v1","updated":"2023-10-13T12:12:34Z","published":"2023-10-13T12:12:34Z","title":"KCTS: Knowledge-Constrained Tree Search Decoding with Token-Level\n Hallucination Detection","summary":" Large Language Models (LLMs) have demonstrated remarkable human-level natural\nlanguage generation capabilities. However, their potential to generate\nmisinformation, often called the hallucination problem, poses a significant\nrisk to their deployment. A common approach to address this issue is to\nretrieve relevant knowledge and fine-tune the LLM with the knowledge in its\ninput. Unfortunately, this method incurs high training costs and may cause\ncatastrophic forgetting for multi-tasking models. To overcome these\nlimitations, we propose a knowledge-constrained decoding method called KCTS\n(Knowledge-Constrained Tree Search), which guides a frozen LM to generate text\naligned with the reference knowledge at each decoding step using a knowledge\nclassifier score and MCTS (Monte-Carlo Tree Search). To adapt the\nsequence-level knowledge classifier to token-level guidance, we also propose a\nnovel token-level hallucination detection method called RIPA (Reward Inflection\nPoint Approximation). Our empirical results on knowledge-grounded dialogue and\nabstractive summarization demonstrate the strength of KCTS as a plug-and-play,\nmodel-agnostic decoding method that can effectively reduce hallucinations in\nnatural language generation.\n","authors":["Sehyun Choi","Tianqing Fang","Zhaowei Wang","Yangqiu Song"],"pdf_url":"https://arxiv.org/pdf/2310.09044v1.pdf","comment":"Accepted at EMNLP 2023 Main Conference"},{"id":"http://arxiv.org/abs/2207.09768v3","updated":"2023-10-13T12:09:26Z","published":"2022-07-20T09:23:35Z","title":"Learning Counterfactually Invariant Predictors","summary":" Notions of counterfactual invariance (CI) have proven essential for\npredictors that are fair, robust, and generalizable in the real world. We\npropose graphical criteria that yield a sufficient condition for a predictor to\nbe counterfactually invariant in terms of a conditional independence in the\nobservational distribution. In order to learn such predictors, we propose a\nmodel-agnostic framework, called Counterfactually Invariant Prediction (CIP),\nbuilding on the Hilbert-Schmidt Conditional Independence Criterion (HSCIC), a\nkernel-based conditional dependence measure. Our experimental results\ndemonstrate the effectiveness of CIP in enforcing counterfactual invariance\nacross various simulated and real-world datasets including scalar and\nmulti-variate settings.\n","authors":["Francesco Quinzan","Cecilia Casolo","Krikamol Muandet","Yucen Luo","Niki Kilbertus"],"pdf_url":"https://arxiv.org/pdf/2207.09768v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09040v1","updated":"2023-10-13T12:07:36Z","published":"2023-10-13T12:07:36Z","title":"Optimal Scheduling of Electric Vehicle Charging with Deep Reinforcement\n Learning considering End Users Flexibility","summary":" The rapid growth of decentralized energy resources and especially Electric\nVehicles (EV), that are expected to increase sharply over the next decade, will\nput further stress on existing power distribution networks, increasing the need\nfor higher system reliability and flexibility. In an attempt to avoid\nunnecessary network investments and to increase the controllability over\ndistribution networks, network operators develop demand response (DR) programs\nthat incentivize end users to shift their consumption in return for financial\nor other benefits. Artificial intelligence (AI) methods are in the research\nforefront for residential load scheduling applications, mainly due to their\nhigh accuracy, high computational speed and lower dependence on the physical\ncharacteristics of the models under development. The aim of this work is to\nidentify households' EV cost-reducing charging policy under a Time-of-Use\ntariff scheme, with the use of Deep Reinforcement Learning, and more\nspecifically Deep Q-Networks (DQN). A novel end users flexibility potential\nreward is inferred from historical data analysis, where households with solar\npower generation have been used to train and test the designed algorithm. The\nsuggested DQN EV charging policy can lead to more than 20% of savings in end\nusers electricity bills.\n","authors":["Christoforos Menos-Aikateriniadis","Stavros Sykiotis","Pavlos S. Georgilakis"],"pdf_url":"https://arxiv.org/pdf/2310.09040v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09031v1","updated":"2023-10-13T11:47:41Z","published":"2023-10-13T11:47:41Z","title":"MINDE: Mutual Information Neural Diffusion Estimation","summary":" In this work we present a new method for the estimation of Mutual Information\n(MI) between random variables. Our approach is based on an original\ninterpretation of the Girsanov theorem, which allows us to use score-based\ndiffusion models to estimate the Kullback Leibler divergence between two\ndensities as a difference between their score functions. As a by-product, our\nmethod also enables the estimation of the entropy of random variables. Armed\nwith such building blocks, we present a general recipe to measure MI, which\nunfolds in two directions: one uses conditional diffusion process, whereas the\nother uses joint diffusion processes that allow simultaneous modelling of two\nrandom variables. Our results, which derive from a thorough experimental\nprotocol over all the variants of our approach, indicate that our method is\nmore accurate than the main alternatives from the literature, especially for\nchallenging distributions. Furthermore, our methods pass MI self-consistency\ntests, including data processing and additivity under independence, which\ninstead are a pain-point of existing methods.\n","authors":["Giulio Franzese","Mustapha Bounoua","Pietro Michiardi"],"pdf_url":"https://arxiv.org/pdf/2310.09031v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09028v1","updated":"2023-10-13T11:40:18Z","published":"2023-10-13T11:40:18Z","title":"Subspace Adaptation Prior for Few-Shot Learning","summary":" Gradient-based meta-learning techniques aim to distill useful prior knowledge\nfrom a set of training tasks such that new tasks can be learned more\nefficiently with gradient descent. While these methods have achieved successes\nin various scenarios, they commonly adapt all parameters of trainable layers\nwhen learning new tasks. This neglects potentially more efficient learning\nstrategies for a given task distribution and may be susceptible to overfitting,\nespecially in few-shot learning where tasks must be learned from a limited\nnumber of examples. To address these issues, we propose Subspace Adaptation\nPrior (SAP), a novel gradient-based meta-learning algorithm that jointly learns\ngood initialization parameters (prior knowledge) and layer-wise parameter\nsubspaces in the form of operation subsets that should be adaptable. In this\nway, SAP can learn which operation subsets to adjust with gradient descent\nbased on the underlying task distribution, simultaneously decreasing the risk\nof overfitting when learning new tasks. We demonstrate that this ability is\nhelpful as SAP yields superior or competitive performance in few-shot image\nclassification settings (gains between 0.1% and 3.9% in accuracy). Analysis of\nthe learned subspaces demonstrates that low-dimensional operations often yield\nhigh activation strengths, indicating that they may be important for achieving\ngood few-shot learning performance. For reproducibility purposes, we publish\nall our research code publicly.\n","authors":["Mike Huisman","Aske Plaat","Jan N. van Rijn"],"pdf_url":"https://arxiv.org/pdf/2310.09028v1.pdf","comment":"Accepted at Machine Learning Journal, Special Issue of the ECML PKDD\n 2023 Journal Track"},{"id":"http://arxiv.org/abs/2302.06887v3","updated":"2023-10-13T11:08:19Z","published":"2023-02-14T08:20:41Z","title":"Learning Graph ARMA Processes from Time-Vertex Spectra","summary":" The modeling of time-varying graph signals as stationary time-vertex\nstochastic processes permits the inference of missing signal values by\nefficiently employing the correlation patterns of the process across different\ngraph nodes and time instants. In this study, we propose an algorithm for\ncomputing graph autoregressive moving average (graph ARMA) processes based on\nlearning the joint time-vertex power spectral density of the process from its\nincomplete realizations for the task of signal interpolation. Our solution\nrelies on first roughly estimating the joint spectrum of the process from\npartially observed realizations and then refining this estimate by projecting\nit onto the spectrum manifold of the graph ARMA process through convex\nrelaxations. The initially missing signal values are then estimated based on\nthe learnt model. Experimental results show that the proposed approach achieves\nhigh accuracy in time-vertex signal estimation problems.\n","authors":["Eylem Tugce Guneyi","Berkay Yaldiz","Abdullah Canbolat","Elif Vural"],"pdf_url":"https://arxiv.org/pdf/2302.06887v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09002v1","updated":"2023-10-13T10:48:28Z","published":"2023-10-13T10:48:28Z","title":"Federated Meta-Learning for Few-Shot Fault Diagnosis with Representation\n Encoding","summary":" Deep learning-based fault diagnosis (FD) approaches require a large amount of\ntraining data, which are difficult to obtain since they are located across\ndifferent entities. Federated learning (FL) enables multiple clients to\ncollaboratively train a shared model with data privacy guaranteed. However, the\ndomain discrepancy and data scarcity problems among clients deteriorate the\nperformance of the global FL model. To tackle these issues, we propose a novel\nframework called representation encoding-based federated meta-learning (REFML)\nfor few-shot FD. First, a novel training strategy based on representation\nencoding and meta-learning is developed. It harnesses the inherent\nheterogeneity among training clients, effectively transforming it into an\nadvantage for out-of-distribution generalization on unseen working conditions\nor equipment types. Additionally, an adaptive interpolation method that\ncalculates the optimal combination of local and global models as the\ninitialization of local training is proposed. This helps to further utilize\nlocal information to mitigate the negative effects of domain discrepancy. As a\nresult, high diagnostic accuracy can be achieved on unseen working conditions\nor equipment types with limited training data. Compared with the\nstate-of-the-art methods, such as FedProx, the proposed REFML framework\nachieves an increase in accuracy by 2.17%-6.50% when tested on unseen working\nconditions of the same equipment type and 13.44%-18.33% when tested on totally\nunseen equipment types, respectively.\n","authors":["Jixuan Cui","Jun Li","Zhen Mei","Kang Wei","Sha Wei","Ming Ding","Wen Chen","Song Guo"],"pdf_url":"https://arxiv.org/pdf/2310.09002v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09000v1","updated":"2023-10-13T10:37:46Z","published":"2023-10-13T10:37:46Z","title":"Measuring the Stability of Process Outcome Predictions in Online\n Settings","summary":" Predictive Process Monitoring aims to forecast the future progress of process\ninstances using historical event data. As predictive process monitoring is\nincreasingly applied in online settings to enable timely interventions,\nevaluating the performance of the underlying models becomes crucial for\nensuring their consistency and reliability over time. This is especially\nimportant in high risk business scenarios where incorrect predictions may have\nsevere consequences. However, predictive models are currently usually evaluated\nusing a single, aggregated value or a time-series visualization, which makes it\nchallenging to assess their performance and, specifically, their stability over\ntime. This paper proposes an evaluation framework for assessing the stability\nof models for online predictive process monitoring. The framework introduces\nfour performance meta-measures: the frequency of significant performance drops,\nthe magnitude of such drops, the recovery rate, and the volatility of\nperformance. To validate this framework, we applied it to two artificial and\ntwo real-world event logs. The results demonstrate that these meta-measures\nfacilitate the comparison and selection of predictive models for different\nrisk-taking scenarios. Such insights are of particular value to enhance\ndecision-making in dynamic business environments.\n","authors":["Suhwan Lee","Marco Comuzzi","Xixi Lu","Hajo A. Reijers"],"pdf_url":"https://arxiv.org/pdf/2310.09000v1.pdf","comment":"8 pages, 3 figures, Proceedings of the 5th International Conference\n on Process Mining (ICPM 2023)"},{"id":"http://arxiv.org/abs/2310.08988v1","updated":"2023-10-13T10:09:12Z","published":"2023-10-13T10:09:12Z","title":"Reroute Prediction Service","summary":" The cost of delays was estimated as 33 billion US dollars only in 2019 for\nthe US National Airspace System, a peak value following a growth trend in past\nyears. Aiming to address this huge inefficiency, we designed and developed a\nnovel Data Analytics and Machine Learning system, which aims at reducing delays\nby proactively supporting re-routing decisions.\n Given a time interval up to a few days in the future, the system predicts if\na reroute advisory for a certain Air Route Traffic Control Center or for a\ncertain advisory identifier will be issued, which may impact the pertinent\nroutes. To deliver such predictions, the system uses historical reroute data,\ncollected from the System Wide Information Management (SWIM) data services\nprovided by the FAA, and weather data, provided by the US National Centers for\nEnvironmental Prediction (NCEP). The data is huge in volume, and has many items\nstreamed at high velocity, uncorrelated and noisy. The system continuously\nprocesses the incoming raw data and makes it available for the next step where\nan interim data store is created and adaptively maintained for efficient query\nprocessing. The resulting data is fed into an array of ML algorithms, which\ncompete for higher accuracy. The best performing algorithm is used in the final\nprediction, generating the final results. Mean accuracy values higher than 90%\nwere obtained in our experiments with this system.\n Our algorithm divides the area of interest in units of aggregation and uses\ntemporal series of the aggregate measures of weather forecast parameters in\neach geographical unit, in order to detect correlations with reroutes and where\nthey will most likely occur. Aiming at practical application, the system is\nformed by a number of microservices, which are deployed in the cloud, making\nthe system distributed, scalable and highly available.\n","authors":["Ítalo Romani de Oliveira","Samet Ayhan","Michael Biglin","Pablo Costas","Euclides C. Pinto Neto"],"pdf_url":"https://arxiv.org/pdf/2310.08988v1.pdf","comment":"Submitted to the 2023 IEEE/AIAA Digital Aviation Systems Conference\n (DASC)"},{"id":"http://arxiv.org/abs/2308.09259v2","updated":"2023-10-13T09:41:45Z","published":"2023-08-18T02:34:37Z","title":"FRGNN: Mitigating the Impact of Distribution Shift on Graph Neural\n Networks via Test-Time Feature Reconstruction","summary":" Due to inappropriate sample selection and limited training data, a\ndistribution shift often exists between the training and test sets. This shift\ncan adversely affect the test performance of Graph Neural Networks (GNNs).\nExisting approaches mitigate this issue by either enhancing the robustness of\nGNNs to distribution shift or reducing the shift itself. However, both\napproaches necessitate retraining the model, which becomes unfeasible when the\nmodel structure and parameters are inaccessible. To address this challenge, we\npropose FR-GNN, a general framework for GNNs to conduct feature reconstruction.\nFRGNN constructs a mapping relationship between the output and input of a\nwell-trained GNN to obtain class representative embeddings and then uses these\nembeddings to reconstruct the features of labeled nodes. These reconstructed\nfeatures are then incorporated into the message passing mechanism of GNNs to\ninfluence the predictions of unlabeled nodes at test time. Notably, the\nreconstructed node features can be directly utilized for testing the\nwell-trained model, effectively reducing the distribution shift and leading to\nimproved test performance. This remarkable achievement is attained without any\nmodifications to the model structure or parameters. We provide theoretical\nguarantees for the effectiveness of our framework. Furthermore, we conduct\ncomprehensive experiments on various public datasets. The experimental results\ndemonstrate the superior performance of FRGNN in comparison to multiple\ncategories of baseline methods.\n","authors":["Rui Ding","Jielong Yang","Feng Ji","Xionghu Zhong","Linbo Xie"],"pdf_url":"https://arxiv.org/pdf/2308.09259v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.16978v2","updated":"2023-10-13T09:37:06Z","published":"2023-06-29T14:32:06Z","title":"Learning Coverage Paths in Unknown Environments with Reinforcement\n Learning","summary":" Coverage path planning (CPP) is the problem of finding a path that covers the\nentire free space of a confined area, with applications ranging from robotic\nlawn mowing and vacuum cleaning, to demining and search-and-rescue tasks. While\noffline methods can find provably complete, and in some cases optimal, paths\nfor known environments, their value is limited in online scenarios where the\nenvironment is not known beforehand. In this case, the path needs to be planned\nonline while mapping the environment. We investigate how suitable reinforcement\nlearning is for this challenging problem, and analyze the involved components\nrequired to efficiently learn coverage paths, such as action space, input\nfeature representation, neural network architecture, and reward function.\nCompared to existing classical methods, this approach allows for a flexible\npath space, and enables the agent to adapt to specific environment dynamics. In\naddition to local sensory inputs for acting on short-term obstacle detections,\nwe propose to use egocentric maps in multiple scales based on frontiers. This\nallows the agent to plan a long-term path in large-scale environments with\nfeasible computational and memory complexity. Furthermore, we propose a novel\ntotal variation reward term for guiding the agent not to leave small holes of\nnon-covered free space. To validate the effectiveness of our approach, we\nperform extensive experiments in simulation with a 2D ranging sensor on\ndifferent variations of the CPP problem, surpassing the performance of both\nprevious RL-based approaches and highly specialized methods.\n","authors":["Arvi Jonnarth","Jie Zhao","Michael Felsberg"],"pdf_url":"https://arxiv.org/pdf/2306.16978v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.08627v2","updated":"2023-10-13T09:12:03Z","published":"2023-03-15T13:54:11Z","title":"From Images to Features: Unbiased Morphology Classification via\n Variational Auto-Encoders and Domain Adaptation","summary":" We present a novel approach for the dimensionality reduction of galaxy images\nby leveraging a combination of variational auto-encoders (VAE) and domain\nadaptation (DA). We demonstrate the effectiveness of this approach using a\nsample of low redshift galaxies with detailed morphological type labels from\nthe Galaxy-Zoo DECaLS project. We show that 40-dimensional latent variables can\neffectively reproduce most morphological features in galaxy images. To further\nvalidate the effectiveness of our approach, we utilised a classical random\nforest (RF) classifier on the 40-dimensional latent variables to make detailed\nmorphology feature classifications. This approach performs similarly to a\ndirect neural network application on galaxy images. We further enhance our\nmodel by tuning the VAE network via DA using galaxies in the overlapping\nfootprint of DECaLS and BASS+MzLS, enabling the unbiased application of our\nmodel to galaxy images in both surveys. We observed that DA led to even better\nmorphological feature extraction and classification performance. Overall, this\ncombination of VAE and DA can be applied to achieve image dimensionality\nreduction, defect image identification, and morphology classification in large\noptical surveys.\n","authors":["Quanfeng Xu","Shiyin Shen","Rafael S. de Souza","Mi Chen","Renhao Ye","Yumei She","Zhu Chen","Emille E. O. Ishida","Alberto Krone-Martins","Rupesh Durgesh"],"pdf_url":"https://arxiv.org/pdf/2303.08627v2.pdf","comment":"Accepted by MNRAS 2023 October 12. 10 pages, 8 figures"},{"id":"http://arxiv.org/abs/2310.08961v1","updated":"2023-10-13T09:11:35Z","published":"2023-10-13T09:11:35Z","title":"PAGE: Equilibrate Personalization and Generalization in Federated\n Learning","summary":" Federated learning (FL) is becoming a major driving force behind machine\nlearning as a service, where customers (clients) collaboratively benefit from\nshared local updates under the orchestration of the service provider (server).\nRepresenting clients' current demands and the server's future demand, local\nmodel personalization and global model generalization are separately\ninvestigated, as the ill-effects of data heterogeneity enforce the community to\nfocus on one over the other. However, these two seemingly competing goals are\nof equal importance rather than black and white issues, and should be achieved\nsimultaneously. In this paper, we propose the first algorithm to balance\npersonalization and generalization on top of game theory, dubbed PAGE, which\nreshapes FL as a co-opetition game between clients and the server. To explore\nthe equilibrium, PAGE further formulates the game as Markov decision processes,\nand leverages the reinforcement learning algorithm, which simplifies the\nsolving complexity. Extensive experiments on four widespread datasets show that\nPAGE outperforms state-of-the-art FL baselines in terms of global and local\nprediction accuracy simultaneously, and the accuracy can be improved by up to\n35.20% and 39.91%, respectively. In addition, biased variants of PAGE imply\npromising adaptiveness to demand shifts in practice.\n","authors":["Qian Chen","Zilong Wang","Jiaqi Hu","Haonan Yan","Jianying Zhou","Xiaodong Lin"],"pdf_url":"https://arxiv.org/pdf/2310.08961v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.09323v2","updated":"2023-10-13T08:58:46Z","published":"2023-09-17T17:01:05Z","title":"Answering Layer 3 queries with DiscoSCMs","summary":" Addressing causal queries across the Pearl Causal Hierarchy (PCH) (i.e.,\nassociational, interventional and counterfactual), which is formalized as\n\\Layer{} Valuations, is a central task in contemporary causal inference\nresearch. Counterfactual questions, in particular, pose a significant challenge\nas they often necessitate a complete knowledge of structural equations. This\npaper identifies \\textbf{the degeneracy problem} caused by the consistency\nrule. To tackle this, the \\textit{Distribution-consistency Structural Causal\nModels} (DiscoSCMs) is introduced, which extends both the structural causal\nmodels (SCM) and the potential outcome framework. The correlation pattern of\npotential outcomes in personalized incentive scenarios, described by $P(y_x,\ny'_{x'})$, is used as a case study for elucidation. Although counterfactuals\nare no longer degenerate, they remain indeterminable. As a result, the\ncondition of independent potential noise is incorporated into DiscoSCM. It is\nfound that by adeptly using homogeneity, counterfactuals can be identified.\nFurthermore, more refined results are achieved in the unit problem scenario. In\nsimpler terms, when modeling counterfactuals, one should contemplate: \"Consider\na person with average ability who takes a test and, due to good luck, achieves\nan exceptionally high score. If this person were to retake the test under\nidentical external conditions, what score will he obtain? An exceptionally high\nscore or an average score?\" If your choose is predicting an average score, then\nyou are essentially choosing DiscoSCM over the traditional frameworks based on\nthe consistency rule.\n","authors":["Heyang Gong"],"pdf_url":"https://arxiv.org/pdf/2309.09323v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.00527v2","updated":"2023-10-13T08:58:34Z","published":"2023-07-02T09:38:43Z","title":"Graph Neural Networks based Log Anomaly Detection and Explanation","summary":" Event logs are widely used to record the status of high-tech systems, making\nlog anomaly detection important for monitoring those systems. Most existing log\nanomaly detection methods take a log event count matrix or log event sequences\nas input, exploiting quantitative and/or sequential relationships between log\nevents to detect anomalies. Unfortunately, only considering quantitative or\nsequential relationships may result in low detection accuracy. To alleviate\nthis problem, we propose a graph-based method for unsupervised log anomaly\ndetection, dubbed Logs2Graphs, which first converts event logs into attributed,\ndirected, and weighted graphs, and then leverages graph neural networks to\nperform graph-level anomaly detection. Specifically, we introduce One-Class\nDigraph Inception Convolutional Networks, abbreviated as OCDiGCN, a novel graph\nneural network model for detecting graph-level anomalies in a collection of\nattributed, directed, and weighted graphs. By coupling the graph representation\nand anomaly detection steps, OCDiGCN can learn a representation that is\nespecially suited for anomaly detection, resulting in a high detection\naccuracy. Importantly, for each identified anomaly, we additionally provide a\nsmall subset of nodes that play a crucial role in OCDiGCN's prediction as\nexplanations, which can offer valuable cues for subsequent root cause\ndiagnosis. Experiments on five benchmark datasets show that Logs2Graphs\nperforms at least on par with state-of-the-art log anomaly detection methods on\nsimple datasets while largely outperforming state-of-the-art log anomaly\ndetection methods on complicated datasets.\n","authors":["Zhong Li","Jiayang Shi","Matthijs van Leeuwen"],"pdf_url":"https://arxiv.org/pdf/2307.00527v2.pdf","comment":"Preprint submitted to Engineering Applications of Artificial\n Intelligence"},{"id":"http://arxiv.org/abs/2310.04993v2","updated":"2023-10-13T08:37:29Z","published":"2023-10-08T03:41:16Z","title":"Prompt-augmented Temporal Point Process for Streaming Event Sequence","summary":" Neural Temporal Point Processes (TPPs) are the prevalent paradigm for\nmodeling continuous-time event sequences, such as user activities on the web\nand financial transactions. In real-world applications, event data is typically\nreceived in a \\emph{streaming} manner, where the distribution of patterns may\nshift over time. Additionally, \\emph{privacy and memory constraints} are\ncommonly observed in practical scenarios, further compounding the challenges.\nTherefore, the continuous monitoring of a TPP to learn the streaming event\nsequence is an important yet under-explored problem. Our work paper addresses\nthis challenge by adopting Continual Learning (CL), which makes the model\ncapable of continuously learning a sequence of tasks without catastrophic\nforgetting under realistic constraints. Correspondingly, we propose a simple\nyet effective framework, PromptTPP\\footnote{Our code is available at {\\small\n\\url{ https://github.com/yanyanSann/PromptTPP}}}, by integrating the base TPP\nwith a continuous-time retrieval prompt pool. The prompts, small learnable\nparameters, are stored in a memory space and jointly optimized with the base\nTPP, ensuring that the model learns event streams sequentially without\nbuffering past examples or task-specific attributes. We present a novel and\nrealistic experimental setup for modeling event streams, where PromptTPP\nconsistently achieves state-of-the-art performance across three real user\nbehavior datasets.\n","authors":["Siqiao Xue","Yan Wang","Zhixuan Chu","Xiaoming Shi","Caigao Jiang","Hongyan Hao","Gangwei Jiang","Xiaoyun Feng","James Y. Zhang","Jun Zhou"],"pdf_url":"https://arxiv.org/pdf/2310.04993v2.pdf","comment":"NeurIPS 2023 camera ready version"},{"id":"http://arxiv.org/abs/2310.08944v1","updated":"2023-10-13T08:19:31Z","published":"2023-10-13T08:19:31Z","title":"CAMELL: Confidence-based Acquisition Model for Efficient Self-supervised\n Active Learning with Label Validation","summary":" Supervised neural approaches are hindered by their dependence on large,\nmeticulously annotated datasets, a requirement that is particularly cumbersome\nfor sequential tasks. The quality of annotations tends to deteriorate with the\ntransition from expert-based to crowd-sourced labelling. To address these\nchallenges, we present \\textbf{CAMELL} (Confidence-based Acquisition Model for\nEfficient self-supervised active Learning with Label validation), a pool-based\nactive learning framework tailored for sequential multi-output problems. CAMELL\npossesses three core features: (1) it requires expert annotators to label only\na fraction of a chosen sequence, (2) it facilitates self-supervision for the\nremainder of the sequence, and (3) it employs a label validation mechanism to\nprevent erroneous labels from contaminating the dataset and harming model\nperformance. We evaluate CAMELL on sequential tasks, with a special emphasis on\ndialogue belief tracking, a task plagued by the constraints of limited and\nnoisy datasets. Our experiments demonstrate that CAMELL outperforms the\nbaselines in terms of efficiency. Furthermore, the data corrections suggested\nby our method contribute to an overall improvement in the quality of the\nresulting datasets.\n","authors":["Carel van Niekerk","Christian Geishauser","Michael Heck","Shutong Feng","Hsien-chin Lin","Nurul Lubis","Benjamin Ruppik","Renato Vukovic","Milica Gašić"],"pdf_url":"https://arxiv.org/pdf/2310.08944v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.05880v4","updated":"2023-10-13T08:00:04Z","published":"2023-06-09T13:20:04Z","title":"Time Series Continuous Modeling for Imputation and Forecasting with\n Implicit Neural Representations","summary":" We introduce a novel modeling approach for time series imputation and\nforecasting, tailored to address the challenges often encountered in real-world\ndata, such as irregular samples, missing data, or unaligned measurements from\nmultiple sensors. Our method relies on a continuous-time-dependent model of the\nseries' evolution dynamics. It leverages adaptations of conditional, implicit\nneural representations for sequential data. A modulation mechanism, driven by a\nmeta-learning algorithm, allows adaptation to unseen samples and extrapolation\nbeyond observed time-windows for long-term predictions. The model provides a\nhighly flexible and unified framework for imputation and forecasting tasks\nacross a wide range of challenging scenarios. It achieves state-of-the-art\nperformance on classical benchmarks and outperforms alternative time-continuous\nmodels.\n","authors":["Etienne Le Naour","Louis Serrano","Léon Migus","Yuan Yin","Ghislain Agoua","Nicolas Baskiotis","Patrick Gallinari","Vincent Guigue"],"pdf_url":"https://arxiv.org/pdf/2306.05880v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.14642v2","updated":"2023-10-13T07:55:20Z","published":"2023-05-24T02:23:00Z","title":"Newton-Cotes Graph Neural Networks: On the Time Evolution of Dynamic\n Systems","summary":" Reasoning system dynamics is one of the most important analytical approaches\nfor many scientific studies. With the initial state of a system as input, the\nrecent graph neural networks (GNNs)-based methods are capable of predicting the\nfuture state distant in time with high accuracy. Although these methods have\ndiverse designs in modeling the coordinates and interacting forces of the\nsystem, we show that they actually share a common paradigm that learns the\nintegration of the velocity over the interval between the initial and terminal\ncoordinates. However, their integrand is constant w.r.t. time. Inspired by this\nobservation, we propose a new approach to predict the integration based on\nseveral velocity estimations with Newton-Cotes formulas and prove its\neffectiveness theoretically. Extensive experiments on several benchmarks\nempirically demonstrate consistent and significant improvement compared with\nthe state-of-the-art methods.\n","authors":["Lingbing Guo","Weiqing Wang","Zhuo Chen","Ningyu Zhang","Zequn Sun","Yixuan Lai","Qiang Zhang","Huajun Chen"],"pdf_url":"https://arxiv.org/pdf/2305.14642v2.pdf","comment":"NeurIPS 2023 (spotlight)"},{"id":"http://arxiv.org/abs/2310.08922v1","updated":"2023-10-13T07:47:44Z","published":"2023-10-13T07:47:44Z","title":"LLaMA Rider: Spurring Large Language Models to Explore the Open World","summary":" Recently, various studies have leveraged Large Language Models (LLMs) to help\ndecision-making and planning in environments, and try to align the LLMs'\nknowledge with the world conditions. Nonetheless, the capacity of LLMs to\ncontinuously acquire environmental knowledge and adapt in an open world remains\nuncertain. In this paper, we propose an approach to spur LLMs to explore the\nopen world, gather experiences, and learn to improve their task-solving\ncapabilities. In this approach, a multi-round feedback-revision mechanism is\nutilized to encourage LLMs to actively select appropriate revision actions\nguided by feedback information from the environment. This facilitates\nexploration and enhances the model's performance. Besides, we integrate\nsub-task relabeling to assist LLMs in maintaining consistency in sub-task\nplanning and help the model learn the combinatorial nature between tasks,\nenabling it to complete a wider range of tasks through training based on the\nacquired exploration experiences. By evaluation in Minecraft, an open-ended\nsandbox world, we demonstrate that our approach LLaMA-Rider enhances the\nefficiency of the LLM in exploring the environment, and effectively improves\nthe LLM's ability to accomplish more tasks through fine-tuning with merely 1.3k\ninstances of collected data, showing minimal training costs compared to the\nbaseline using reinforcement learning.\n","authors":["Yicheng Feng","Yuxuan Wang","Jiazheng Liu","Sipeng Zheng","Zongqing Lu"],"pdf_url":"https://arxiv.org/pdf/2310.08922v1.pdf","comment":"18 pages"},{"id":"http://arxiv.org/abs/2308.02416v2","updated":"2023-10-13T07:44:39Z","published":"2023-08-03T02:07:32Z","title":"Local-Global Temporal Fusion Network with an Attention Mechanism for\n Multiple and Multiclass Arrhythmia Classification","summary":" Clinical decision support systems (CDSSs) have been widely utilized to\nsupport the decisions made by cardiologists when detecting and classifying\narrhythmia from electrocardiograms (ECGs). However, forming a CDSS for the\narrhythmia classification task is challenging due to the varying lengths of\narrhythmias. Although the onset time of arrhythmia varies, previously developed\nmethods have not considered such conditions. Thus, we propose a framework that\nconsists of (i) local temporal information extraction, (ii) global pattern\nextraction, and (iii) local-global information fusion with attention to perform\narrhythmia detection and classification with a constrained input length. The\n10-class and 4-class performances of our approach were assessed by detecting\nthe onset and offset of arrhythmia as an episode and the duration of arrhythmia\nbased on the MIT-BIH arrhythmia database (MITDB) and MIT-BIH atrial\nfibrillation database (AFDB), respectively. The results were statistically\nsuperior to those achieved by the comparison models. To check the\ngeneralization ability of the proposed method, an AFDB-trained model was tested\non the MITDB, and superior performance was attained compared with that of a\nstate-of-the-art model. The proposed method can capture local-global\ninformation and dynamics without incurring information losses. Therefore,\narrhythmias can be recognized more accurately, and their occurrence times can\nbe calculated; thus, the clinical field can create more accurate treatment\nplans by using the proposed method.\n","authors":["Yun Kwan Kim","Minji Lee","Kunwook Jo","Hee Seok Song","Seong-Whan Lee"],"pdf_url":"https://arxiv.org/pdf/2308.02416v2.pdf","comment":"14 pages, 6 figures"},{"id":"http://arxiv.org/abs/2310.08920v1","updated":"2023-10-13T07:44:05Z","published":"2023-10-13T07:44:05Z","title":"Embarrassingly Simple Text Watermarks","summary":" We propose Easymark, a family of embarrassingly simple yet effective\nwatermarks. Text watermarking is becoming increasingly important with the\nadvent of Large Language Models (LLM). LLMs can generate texts that cannot be\ndistinguished from human-written texts. This is a serious problem for the\ncredibility of the text. Easymark is a simple yet effective solution to this\nproblem. Easymark can inject a watermark without changing the meaning of the\ntext at all while a validator can detect if a text was generated from a system\nthat adopted Easymark or not with high credibility. Easymark is extremely easy\nto implement so that it only requires a few lines of code. Easymark does not\nrequire access to LLMs, so it can be implemented on the user-side when the LLM\nproviders do not offer watermarked LLMs. In spite of its simplicity, it\nachieves higher detection accuracy and BLEU scores than the state-of-the-art\ntext watermarking methods. We also prove the impossibility theorem of perfect\nwatermarking, which is valuable in its own right. This theorem shows that no\nmatter how sophisticated a watermark is, a malicious user could remove it from\nthe text, which motivate us to use a simple watermark such as Easymark. We\ncarry out experiments with LLM-generated texts and confirm that Easymark can be\ndetected reliably without any degradation of BLEU and perplexity, and\noutperform state-of-the-art watermarks in terms of both quality and\nreliability.\n","authors":["Ryoma Sato","Yuki Takezawa","Han Bao","Kenta Niwa","Makoto Yamada"],"pdf_url":"https://arxiv.org/pdf/2310.08920v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08917v1","updated":"2023-10-13T07:40:12Z","published":"2023-10-13T07:40:12Z","title":"Relation-aware Ensemble Learning for Knowledge Graph Embedding","summary":" Knowledge graph (KG) embedding is a fundamental task in natural language\nprocessing, and various methods have been proposed to explore semantic patterns\nin distinctive ways. In this paper, we propose to learn an ensemble by\nleveraging existing methods in a relation-aware manner. However, exploring\nthese semantics using relation-aware ensemble leads to a much larger search\nspace than general ensemble methods. To address this issue, we propose a\ndivide-search-combine algorithm RelEns-DSC that searches the relation-wise\nensemble weights independently. This algorithm has the same computation cost as\ngeneral ensemble methods but with much better performance. Experimental results\non benchmark datasets demonstrate the effectiveness of the proposed method in\nefficiently searching relation-aware ensemble weights and achieving\nstate-of-the-art embedding performance. The code is public at\nhttps://github.com/LARS-research/RelEns.\n","authors":["Ling Yue","Yongqi Zhang","Quanming Yao","Yong Li","Xian Wu","Ziheng Zhang","Zhenxi Lin","Yefeng Zheng"],"pdf_url":"https://arxiv.org/pdf/2310.08917v1.pdf","comment":"This short paper has been accepted by EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.08910v1","updated":"2023-10-13T07:31:04Z","published":"2023-10-13T07:31:04Z","title":"Scalarization for Multi-Task and Multi-Domain Learning at Scale","summary":" Training a single model on multiple input domains and/or output tasks allows\nfor compressing information from multiple sources into a unified backbone hence\nimproves model efficiency. It also enables potential positive knowledge\ntransfer across tasks/domains, leading to improved accuracy and data-efficient\ntraining. However, optimizing such networks is a challenge, in particular due\nto discrepancies between the different tasks or domains: Despite several\nhypotheses and solutions proposed over the years, recent work has shown that\nuniform scalarization training, i.e., simply minimizing the average of the task\nlosses, yields on-par performance with more costly SotA optimization methods.\nThis raises the issue of how well we understand the training dynamics of\nmulti-task and multi-domain networks. In this work, we first devise a\nlarge-scale unified analysis of multi-domain and multi-task learning to better\nunderstand the dynamics of scalarization across varied task/domain combinations\nand model sizes. Following these insights, we then propose to leverage\npopulation-based training to efficiently search for the optimal scalarization\nweights when dealing with a large number of tasks or domains.\n","authors":["Amelie Royer","Tijmen Blankevoort","Babak Ehteshami Bejnordi"],"pdf_url":"https://arxiv.org/pdf/2310.08910v1.pdf","comment":"NeurIPS 2023; https://openreview.net/forum?id=TSuq3debnD"},{"id":"http://arxiv.org/abs/2310.08909v1","updated":"2023-10-13T07:30:50Z","published":"2023-10-13T07:30:50Z","title":"Community Membership Hiding as Counterfactual Graph Search via Deep\n Reinforcement Learning","summary":" Community detection techniques are useful tools for social media platforms to\ndiscover tightly connected groups of users who share common interests. However,\nthis functionality often comes at the expense of potentially exposing\nindividuals to privacy breaches by inadvertently revealing their tastes or\npreferences. Therefore, some users may wish to safeguard their anonymity and\nopt out of community detection for various reasons, such as affiliation with\npolitical or religious organizations.\n In this study, we address the challenge of community membership hiding, which\ninvolves strategically altering the structural properties of a network graph to\nprevent one or more nodes from being identified by a given community detection\nalgorithm. We tackle this problem by formulating it as a constrained\ncounterfactual graph objective, and we solve it via deep reinforcement\nlearning. We validate the effectiveness of our method through two distinct\ntasks: node and community deception. Extensive experiments show that our\napproach overall outperforms existing baselines in both tasks.\n","authors":["Andrea Bernini","Fabrizio Silvestri","Gabriele Tolomei"],"pdf_url":"https://arxiv.org/pdf/2310.08909v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2202.03574v4","updated":"2023-10-13T07:13:51Z","published":"2022-02-04T12:30:49Z","title":"Structured Prediction Problem Archive","summary":" Structured prediction problems are one of the fundamental tools in machine\nlearning. In order to facilitate algorithm development for their numerical\nsolution, we collect in one place a large number of datasets in easy to read\nformats for a diverse set of problem classes. We provide archival links to\ndatasets, description of the considered problems and problem formats, and a\nshort summary of problem characteristics including size, number of instances\netc. For reference we also give a non-exhaustive selection of algorithms\nproposed in the literature for their solution. We hope that this central\nrepository will make benchmarking and comparison to established works easier.\nWe welcome submission of interesting new datasets and algorithms for inclusion\nin our archive.\n","authors":["Paul Swoboda","Ahmed Abbas","Florian Bernard","Andrea Hornakova","Paul Roetzer","Bogdan Savchynskyy"],"pdf_url":"https://arxiv.org/pdf/2202.03574v4.pdf","comment":"Added new shape matching instances based of learned descriptors"},{"id":"http://arxiv.org/abs/2310.08897v1","updated":"2023-10-13T06:58:52Z","published":"2023-10-13T06:58:52Z","title":"Self supervised convolutional kernel based handcrafted feature\n harmonization: Enhanced left ventricle hypertension disease phenotyping on\n echocardiography","summary":" Radiomics, a medical imaging technique, extracts quantitative handcrafted\nfeatures from images to predict diseases. Harmonization in those features\nensures consistent feature extraction across various imaging devices and\nprotocols. Methods for harmonization include standardized imaging protocols,\nstatistical adjustments, and evaluating feature robustness. Myocardial diseases\nsuch as Left Ventricular Hypertrophy (LVH) and Hypertensive Heart Disease (HHD)\nare diagnosed via echocardiography, but variable imaging settings pose\nchallenges. Harmonization techniques are crucial for applying handcrafted\nfeatures in disease diagnosis in such scenario. Self-supervised learning (SSL)\nenhances data understanding within limited datasets and adapts to diverse data\nsettings. ConvNeXt-V2 integrates convolutional layers into SSL, displaying\nsuperior performance in various tasks. This study focuses on convolutional\nfilters within SSL, using them as preprocessing to convert images into feature\nmaps for handcrafted feature harmonization. Our proposed method excelled in\nharmonization evaluation and exhibited superior LVH classification performance\ncompared to existing methods.\n","authors":["Jina Lee","Youngtaek Hong","Dawun Jeong","Yeonggul Jang","Sihyeon Jeong","Taekgeun Jung","Yeonyee E. Yoon","Inki Moon","Seung-Ah Lee","Hyuk-Jae Chang"],"pdf_url":"https://arxiv.org/pdf/2310.08897v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08891v1","updated":"2023-10-13T06:53:02Z","published":"2023-10-13T06:53:02Z","title":"EHI: End-to-end Learning of Hierarchical Index for Efficient Dense\n Retrieval","summary":" Dense embedding-based retrieval is now the industry standard for semantic\nsearch and ranking problems, like obtaining relevant web documents for a given\nquery. Such techniques use a two-stage process: (a) contrastive learning to\ntrain a dual encoder to embed both the query and documents and (b) approximate\nnearest neighbor search (ANNS) for finding similar documents for a given query.\nThese two stages are disjoint; the learned embeddings might be ill-suited for\nthe ANNS method and vice-versa, leading to suboptimal performance. In this\nwork, we propose End-to-end Hierarchical Indexing -- EHI -- that jointly learns\nboth the embeddings and the ANNS structure to optimize retrieval performance.\nEHI uses a standard dual encoder model for embedding queries and documents\nwhile learning an inverted file index (IVF) style tree structure for efficient\nANNS. To ensure stable and efficient learning of discrete tree-based ANNS\nstructure, EHI introduces the notion of dense path embedding that captures the\nposition of a query/document in the tree. We demonstrate the effectiveness of\nEHI on several benchmarks, including de-facto industry standard MS MARCO (Dev\nset and TREC DL19) datasets. For example, with the same compute budget, EHI\noutperforms state-of-the-art (SOTA) in by 0.6% (MRR@10) on MS MARCO dev set and\nby 4.2% (nDCG@10) on TREC DL19 benchmarks.\n","authors":["Ramnath Kumar","Anshul Mittal","Nilesh Gupta","Aditya Kusupati","Inderjit Dhillon","Prateek Jain"],"pdf_url":"https://arxiv.org/pdf/2310.08891v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.02182v3","updated":"2023-10-13T06:51:27Z","published":"2023-08-04T07:54:45Z","title":"AutoML4ETC: Automated Neural Architecture Search for Real-World\n Encrypted Traffic Classification","summary":" Deep learning (DL) has been successfully applied to encrypted network traffic\nclassification in experimental settings. However, in production use, it has\nbeen shown that a DL classifier's performance inevitably decays over time.\nRe-training the model on newer datasets has been shown to only partially\nimprove its performance. Manually re-tuning the model architecture to meet the\nperformance expectations on newer datasets is time-consuming and requires\ndomain expertise. We propose AutoML4ETC, a novel tool to automatically design\nefficient and high-performing neural architectures for encrypted traffic\nclassification. We define a novel, powerful search space tailored specifically\nfor the early classification of encrypted traffic using packet header bytes. We\nshow that with different search strategies over our search space, AutoML4ETC\ngenerates neural architectures that outperform the state-of-the-art encrypted\ntraffic classifiers on several datasets, including public benchmark datasets\nand real-world TLS and QUIC traffic collected from the Orange mobile network.\nIn addition to being more accurate, AutoML4ETC's architectures are\nsignificantly more efficient and lighter in terms of the number of parameters.\nFinally, we make AutoML4ETC publicly available for future research.\n","authors":["Navid Malekghaini","Elham Akbari","Mohammad A. Salahuddin","Noura Limam","Raouf Boutaba","Bertrand Mathieu","Stephanie Moteau","Stephane Tuffin"],"pdf_url":"https://arxiv.org/pdf/2308.02182v3.pdf","comment":"Paper accepted for publication in IEEE TNSM journal. Please cite that\n version"},{"id":"http://arxiv.org/abs/2303.13937v5","updated":"2023-10-13T06:43:44Z","published":"2023-03-24T11:50:08Z","title":"Topological Reconstruction of Particle Physics Processes using Graph\n Neural Networks","summary":" We present a new approach, the Topograph, which reconstructs underlying\nphysics processes, including the intermediary particles, by leveraging\nunderlying priors from the nature of particle physics decays and the\nflexibility of message passing graph neural networks. The Topograph not only\nsolves the combinatoric assignment of observed final state objects, associating\nthem to their original mother particles, but directly predicts the properties\nof intermediate particles in hard scatter processes and their subsequent\ndecays. In comparison to standard combinatoric approaches or modern approaches\nusing graph neural networks, which scale exponentially or quadratically, the\ncomplexity of Topographs scales linearly with the number of reconstructed\nobjects.\n We apply Topographs to top quark pair production in the all hadronic decay\nchannel, where we outperform the standard approach and match the performance of\nthe state-of-the-art machine learning technique.\n","authors":["Lukas Ehrke","John Andrew Raine","Knut Zoch","Manuel Guth","Tobias Golling"],"pdf_url":"https://arxiv.org/pdf/2303.13937v5.pdf","comment":"25 pages, 24 figures, 8 tables"},{"id":"http://arxiv.org/abs/2310.08887v1","updated":"2023-10-13T06:43:11Z","published":"2023-10-13T06:43:11Z","title":"METRA: Scalable Unsupervised RL with Metric-Aware Abstraction","summary":" Unsupervised pre-training strategies have proven to be highly effective in\nnatural language processing and computer vision. Likewise, unsupervised\nreinforcement learning (RL) holds the promise of discovering a variety of\npotentially useful behaviors that can accelerate the learning of a wide array\nof downstream tasks. Previous unsupervised RL approaches have mainly focused on\npure exploration and mutual information skill learning. However, despite the\nprevious attempts, making unsupervised RL truly scalable still remains a major\nopen challenge: pure exploration approaches might struggle in complex\nenvironments with large state spaces, where covering every possible transition\nis infeasible, and mutual information skill learning approaches might\ncompletely fail to explore the environment due to the lack of incentives. To\nmake unsupervised RL scalable to complex, high-dimensional environments, we\npropose a novel unsupervised RL objective, which we call Metric-Aware\nAbstraction (METRA). Our main idea is, instead of directly covering the entire\nstate space, to only cover a compact latent space $Z$ that is metrically\nconnected to the state space $S$ by temporal distances. By learning to move in\nevery direction in the latent space, METRA obtains a tractable set of diverse\nbehaviors that approximately cover the state space, being scalable to\nhigh-dimensional environments. Through our experiments in five locomotion and\nmanipulation environments, we demonstrate that METRA can discover a variety of\nuseful behaviors even in complex, pixel-based environments, being the first\nunsupervised RL method that discovers diverse locomotion behaviors in\npixel-based Quadruped and Humanoid. Our code and videos are available at\nhttps://seohong.me/projects/metra/\n","authors":["Seohong Park","Oleh Rybkin","Sergey Levine"],"pdf_url":"https://arxiv.org/pdf/2310.08887v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.17625v2","updated":"2023-10-13T06:36:06Z","published":"2023-05-28T04:08:40Z","title":"Cross-Domain Policy Adaptation via Value-Guided Data Filtering","summary":" Generalizing policies across different domains with dynamics mismatch poses a\nsignificant challenge in reinforcement learning. For example, a robot learns\nthe policy in a simulator, but when it is deployed in the real world, the\ndynamics of the environment may be different. Given the source and target\ndomain with dynamics mismatch, we consider the online dynamics adaptation\nproblem, in which case the agent can access sufficient source domain data while\nonline interactions with the target domain are limited. Existing research has\nattempted to solve the problem from the dynamics discrepancy perspective. In\nthis work, we reveal the limitations of these methods and explore the problem\nfrom the value difference perspective via a novel insight on the value\nconsistency across domains. Specifically, we present the Value-Guided Data\nFiltering (VGDF) algorithm, which selectively shares transitions from the\nsource domain based on the proximity of paired value targets across the two\ndomains. Empirical results on various environments with kinematic and\nmorphology shifts demonstrate that our method achieves superior performance\ncompared to prior approaches.\n","authors":["Kang Xu","Chenjia Bai","Xiaoteng Ma","Dong Wang","Bin Zhao","Zhen Wang","Xuelong Li","Wei Li"],"pdf_url":"https://arxiv.org/pdf/2305.17625v2.pdf","comment":"27 pages, 15 figures"},{"id":"http://arxiv.org/abs/2310.08876v1","updated":"2023-10-13T06:03:07Z","published":"2023-10-13T06:03:07Z","title":"Gesture Recognition for FMCW Radar on the Edge","summary":" This paper introduces a lightweight gesture recognition system based on 60\nGHz frequency modulated continuous wave (FMCW) radar. We show that gestures can\nbe characterized efficiently by a set of five features, and propose a slim\nradar processing algorithm to extract these features. In contrast to previous\napproaches, we avoid heavy 2D processing, i.e. range-Doppler imaging, and\nperform instead an early target detection - this allows us to port the system\nto fully embedded platforms with tight constraints on memory, compute and power\nconsumption. A recurrent neural network (RNN) based architecture exploits these\nfeatures to jointly detect and classify five different gestures. The proposed\nsystem recognizes gestures with an F1 score of 98.4% on our hold-out test\ndataset, it runs on an Arm Cortex-M4 microcontroller requiring less than 280 kB\nof flash memory, 120 kB of RAM, and consuming 75 mW of power.\n","authors":["Maximilian Strobel","Stephan Schoenfeldt","Jonas Daugalas"],"pdf_url":"https://arxiv.org/pdf/2310.08876v1.pdf","comment":"4 pages, 5 figures, submitted to 2024 IEEE Topical Conference on\n Wireless Sensors and Sensor Networks (WiSNeT)"},{"id":"http://arxiv.org/abs/2309.03084v3","updated":"2023-10-13T06:01:17Z","published":"2023-09-04T09:16:49Z","title":"Pure Monte Carlo Counterfactual Regret Minimization","summary":" Counterfactual Regret Minimization (CFR) and its variants are the best\nalgorithms so far for solving large-scale incomplete information games.\nHowever, we believe that there are two problems with CFR: First, matrix\nmultiplication is required in CFR iteration, and the time complexity of one\niteration is too high; Secondly, the game characteristics in the real world are\ndifferent. Just using one CFR algorithm will not be perfectly suitable for all\ngame problems.\n For these two problems, this paper proposes a new algorithm called Pure CFR\n(PCFR) based on CFR. PCFR can be seen as a combination of CFR and Fictitious\nPlay (FP), inheriting the concept of counterfactual regret (value) from CFR,\nand using the best response strategy instead of the regret matching strategy\nfor the next iteration. This algorithm has three advantages. First, PCFR can be\ncombined with any CFR variant. The resulting Pure MCCFR (PMCCFR) can\nsignificantly reduce the time and space complexity of one iteration. Secondly,\nour experiments show that the convergence speed of the PMCCFR is 2$\\sim$3 times\nthat of the MCCFR. Finally, there is a type of game that is very suitable for\nPCFR. We call this type of game clear-game, which is characterized by a high\nproportion of dominated strategies. Experiments show that in clear-game, the\nconvergence rate of PMCCFR is two orders of magnitude higher than that of\nMCCFR.\n","authors":["Ju Qi","Ting Feng","Falun Hei","Zhemei Fang","Yunfeng Luo"],"pdf_url":"https://arxiv.org/pdf/2309.03084v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.03986v2","updated":"2023-10-13T05:35:40Z","published":"2023-10-06T03:04:21Z","title":"Robust Multimodal Learning with Missing Modalities via\n Parameter-Efficient Adaptation","summary":" Multimodal learning seeks to utilize data from multiple sources to improve\nthe overall performance of downstream tasks. It is desirable for redundancies\nin the data to make multimodal systems robust to missing or corrupted\nobservations in some correlated modalities. However, we observe that the\nperformance of several existing multimodal networks significantly deteriorates\nif one or multiple modalities are absent at test time. To enable robustness to\nmissing modalities, we propose simple and parameter-efficient adaptation\nprocedures for pretrained multimodal networks. In particular, we exploit\nlow-rank adaptation and modulation of intermediate features to compensate for\nthe missing modalities. We demonstrate that such adaptation can partially\nbridge performance drop due to missing modalities and outperform independent,\ndedicated networks trained for the available modality combinations in some\ncases. The proposed adaptation requires extremely small number of parameters\n(e.g., fewer than 0.7% of the total parameters in most experiments). We conduct\na series of experiments to highlight the robustness of our proposed method\nusing diverse datasets for RGB-thermal and RGB-Depth semantic segmentation,\nmultimodal material segmentation, and multimodal sentiment analysis tasks. Our\nproposed method demonstrates versatility across various tasks and datasets, and\noutperforms existing methods for robust multimodal learning with missing\nmodalities.\n","authors":["Md Kaykobad Reza","Ashley Prater-Bennette","M. Salman Asif"],"pdf_url":"https://arxiv.org/pdf/2310.03986v2.pdf","comment":"18 pages, 3 figures, 11 tables"},{"id":"http://arxiv.org/abs/2310.08867v1","updated":"2023-10-13T05:35:13Z","published":"2023-10-13T05:35:13Z","title":"A Survey of Methods for Handling Disk Data Imbalance","summary":" Class imbalance exists in many classification problems, and since the data is\ndesigned for accuracy, imbalance in data classes can lead to classification\nchallenges with a few classes having higher misclassification costs. The\nBackblaze dataset, a widely used dataset related to hard discs, has a small\namount of failure data and a large amount of health data, which exhibits a\nserious class imbalance. This paper provides a comprehensive overview of\nresearch in the field of imbalanced data classification. The discussion is\norganized into three main aspects: data-level methods, algorithmic-level\nmethods, and hybrid methods. For each type of method, we summarize and analyze\nthe existing problems, algorithmic ideas, strengths, and weaknesses.\nAdditionally, the challenges of unbalanced data classification are discussed,\nalong with strategies to address them. It is convenient for researchers to\nchoose the appropriate method according to their needs.\n","authors":["Shuangshuang Yuan","Peng Wu","Yuehui Chen","Qiang Li"],"pdf_url":"https://arxiv.org/pdf/2310.08867v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08866v1","updated":"2023-10-13T05:29:09Z","published":"2023-10-13T05:29:09Z","title":"Adaptivity and Modularity for Efficient Generalization Over Task\n Complexity","summary":" Can transformers generalize efficiently on problems that require dealing with\nexamples with different levels of difficulty? We introduce a new task tailored\nto assess generalization over different complexities and present results that\nindicate that standard transformers face challenges in solving these tasks.\nThese tasks are variations of pointer value retrieval previously introduced by\nZhang et al. (2021). We investigate how the use of a mechanism for adaptive and\nmodular computation in transformers facilitates the learning of tasks that\ndemand generalization over the number of sequential computation steps (i.e.,\nthe depth of the computation graph). Based on our observations, we propose a\ntransformer-based architecture called Hyper-UT, which combines dynamic function\ngeneration from hyper networks with adaptive depth from Universal Transformers.\nThis model demonstrates higher accuracy and a fairer allocation of\ncomputational resources when generalizing to higher numbers of computation\nsteps. We conclude that mechanisms for adaptive depth and modularity complement\neach other in improving efficient generalization concerning example complexity.\nAdditionally, to emphasize the broad applicability of our findings, we\nillustrate that in a standard image recognition task, Hyper- UT's performance\nmatches that of a ViT model but with considerably reduced computational demands\n(achieving over 70\\% average savings by effectively using fewer layers).\n","authors":["Samira Abnar","Omid Saremi","Laurent Dinh","Shantel Wilson","Miguel Angel Bautista","Chen Huang","Vimal Thilak","Etai Littwin","Jiatao Gu","Josh Susskind","Samy Bengio"],"pdf_url":"https://arxiv.org/pdf/2310.08866v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08863v1","updated":"2023-10-13T05:12:48Z","published":"2023-10-13T05:12:48Z","title":"In-Context Learning for Few-Shot Molecular Property Prediction","summary":" In-context learning has become an important approach for few-shot learning in\nLarge Language Models because of its ability to rapidly adapt to new tasks\nwithout fine-tuning model parameters. However, it is restricted to applications\nin natural language and inapplicable to other domains. In this paper, we adapt\nthe concepts underpinning in-context learning to develop a new algorithm for\nfew-shot molecular property prediction. Our approach learns to predict\nmolecular properties from a context of (molecule, property measurement) pairs\nand rapidly adapts to new properties without fine-tuning. On the FS-Mol and\nBACE molecular property prediction benchmarks, we find this method surpasses\nthe performance of recent meta-learning algorithms at small support sizes and\nis competitive with the best methods at large support sizes.\n","authors":["Christopher Fifty","Jure Leskovec","Sebastian Thrun"],"pdf_url":"https://arxiv.org/pdf/2310.08863v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08858v1","updated":"2023-10-13T04:59:44Z","published":"2023-10-13T04:59:44Z","title":"Adam-family Methods with Decoupled Weight Decay in Deep Learning","summary":" In this paper, we investigate the convergence properties of a wide class of\nAdam-family methods for minimizing quadratically regularized nonsmooth\nnonconvex optimization problems, especially in the context of training\nnonsmooth neural networks with weight decay. Motivated by the AdamW method, we\npropose a novel framework for Adam-family methods with decoupled weight decay.\nWithin our framework, the estimators for the first-order and second-order\nmoments of stochastic subgradients are updated independently of the weight\ndecay term. Under mild assumptions and with non-diminishing stepsizes for\nupdating the primary optimization variables, we establish the convergence\nproperties of our proposed framework. In addition, we show that our proposed\nframework encompasses a wide variety of well-known Adam-family methods, hence\noffering convergence guarantees for these methods in the training of nonsmooth\nneural networks. More importantly, we show that our proposed framework\nasymptotically approximates the SGD method, thereby providing an explanation\nfor the empirical observation that decoupled weight decay enhances\ngeneralization performance for Adam-family methods. As a practical application\nof our proposed framework, we propose a novel Adam-family method named Adam\nwith Decoupled Weight Decay (AdamD), and establish its convergence properties\nunder mild conditions. Numerical experiments demonstrate that AdamD outperforms\nAdam and is comparable to AdamW, in the aspects of both generalization\nperformance and efficiency.\n","authors":["Kuangyu Ding","Nachuan Xiao","Kim-Chuan Toh"],"pdf_url":"https://arxiv.org/pdf/2310.08858v1.pdf","comment":"26 pages"},{"id":"http://arxiv.org/abs/2310.08855v1","updated":"2023-10-13T04:50:40Z","published":"2023-10-13T04:50:40Z","title":"Overcoming Recency Bias of Normalization Statistics in Continual\n Learning: Balance and Adaptation","summary":" Continual learning entails learning a sequence of tasks and balancing their\nknowledge appropriately. With limited access to old training samples, much of\nthe current work in deep neural networks has focused on overcoming catastrophic\nforgetting of old tasks in gradient-based optimization. However, the\nnormalization layers provide an exception, as they are updated interdependently\nby the gradient and statistics of currently observed training samples, which\nrequire specialized strategies to mitigate recency bias. In this work, we focus\non the most popular Batch Normalization (BN) and provide an in-depth\ntheoretical analysis of its sub-optimality in continual learning. Our analysis\ndemonstrates the dilemma between balance and adaptation of BN statistics for\nincremental tasks, which potentially affects training stability and\ngeneralization. Targeting on these particular challenges, we propose Adaptive\nBalance of BN (AdaB$^2$N), which incorporates appropriately a Bayesian-based\nstrategy to adapt task-wise contributions and a modified momentum to balance BN\nstatistics, corresponding to the training and testing stages. By implementing\nBN in a continual learning fashion, our approach achieves significant\nperformance gains across a wide range of benchmarks, particularly for the\nchallenging yet realistic online scenarios (e.g., up to 7.68%, 6.86% and 4.26%\non Split CIFAR-10, Split CIFAR-100 and Split Mini-ImageNet, respectively). Our\ncode is available at https://github.com/lvyilin/AdaB2N.\n","authors":["Yilin Lyu","Liyuan Wang","Xingxing Zhang","Zicheng Sun","Hang Su","Jun Zhu","Liping Jing"],"pdf_url":"https://arxiv.org/pdf/2310.08855v1.pdf","comment":"Accepted by NeurIPS 2023"},{"id":"http://arxiv.org/abs/2306.17439v2","updated":"2023-10-13T04:50:04Z","published":"2023-06-30T07:24:32Z","title":"Provable Robust Watermarking for AI-Generated Text","summary":" We study the problem of watermarking large language models (LLMs) generated\ntext -- one of the most promising approaches for addressing the safety\nchallenges of LLM usage. In this paper, we propose a rigorous theoretical\nframework to quantify the effectiveness and robustness of LLM watermarks. We\npropose a robust and high-quality watermark method, Unigram-Watermark, by\nextending an existing approach with a simplified fixed grouping strategy. We\nprove that our watermark method enjoys guaranteed generation quality,\ncorrectness in watermark detection, and is robust against text editing and\nparaphrasing. Experiments on three varying LLMs and two datasets verify that\nour Unigram-Watermark achieves superior detection accuracy and comparable\ngeneration quality in perplexity, thus promoting the responsible use of LLMs.\nCode is available at https://github.com/XuandongZhao/Unigram-Watermark.\n","authors":["Xuandong Zhao","Prabhanjan Ananth","Lei Li","Yu-Xiang Wang"],"pdf_url":"https://arxiv.org/pdf/2306.17439v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08854v1","updated":"2023-10-13T04:48:32Z","published":"2023-10-13T04:48:32Z","title":"Rank-DETR for High Quality Object Detection","summary":" Modern detection transformers (DETRs) use a set of object queries to predict\na list of bounding boxes, sort them by their classification confidence scores,\nand select the top-ranked predictions as the final detection results for the\ngiven input image. A highly performant object detector requires accurate\nranking for the bounding box predictions. For DETR-based detectors, the\ntop-ranked bounding boxes suffer from less accurate localization quality due to\nthe misalignment between classification scores and localization accuracy, thus\nimpeding the construction of high-quality detectors. In this work, we introduce\na simple and highly performant DETR-based object detector by proposing a series\nof rank-oriented designs, combinedly called Rank-DETR. Our key contributions\ninclude: (i) a rank-oriented architecture design that can prompt positive\npredictions and suppress the negative ones to ensure lower false positive\nrates, as well as (ii) a rank-oriented loss function and matching cost design\nthat prioritizes predictions of more accurate localization accuracy during\nranking to boost the AP under high IoU thresholds. We apply our method to\nimprove the recent SOTA methods (e.g., H-DETR and DINO-DETR) and report strong\nCOCO object detection results when using different backbones such as\nResNet-$50$, Swin-T, and Swin-L, demonstrating the effectiveness of our\napproach. Code is available at \\url{https://github.com/LeapLabTHU/Rank-DETR}.\n","authors":["Yifan Pu","Weicong Liang","Yiduo Hao","Yuhui Yuan","Yukang Yang","Chao Zhang","Han Hu","Gao Huang"],"pdf_url":"https://arxiv.org/pdf/2310.08854v1.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.08848v1","updated":"2023-10-13T04:22:21Z","published":"2023-10-13T04:22:21Z","title":"Semi-Supervised End-To-End Contrastive Learning For Time Series\n Classification","summary":" Time series classification is a critical task in various domains, such as\nfinance, healthcare, and sensor data analysis. Unsupervised contrastive\nlearning has garnered significant interest in learning effective\nrepresentations from time series data with limited labels. The prevalent\napproach in existing contrastive learning methods consists of two separate\nstages: pre-training the encoder on unlabeled datasets and fine-tuning the\nwell-trained model on a small-scale labeled dataset. However, such two-stage\napproaches suffer from several shortcomings, such as the inability of\nunsupervised pre-training contrastive loss to directly affect downstream\nfine-tuning classifiers, and the lack of exploiting the classification loss\nwhich is guided by valuable ground truth. In this paper, we propose an\nend-to-end model called SLOTS (Semi-supervised Learning fOr Time\nclasSification). SLOTS receives semi-labeled datasets, comprising a large\nnumber of unlabeled samples and a small proportion of labeled samples, and maps\nthem to an embedding space through an encoder. We calculate not only the\nunsupervised contrastive loss but also measure the supervised contrastive loss\non the samples with ground truth. The learned embeddings are fed into a\nclassifier, and the classification loss is calculated using the available true\nlabels. The unsupervised, supervised contrastive losses and classification loss\nare jointly used to optimize the encoder and classifier. We evaluate SLOTS by\ncomparing it with ten state-of-the-art methods across five datasets. The\nresults demonstrate that SLOTS is a simple yet effective framework. When\ncompared to the two-stage framework, our end-to-end SLOTS utilizes the same\ninput data, consumes a similar computational cost, but delivers significantly\nimproved performance. We release code and datasets at\nhttps://anonymous.4open.science/r/SLOTS-242E.\n","authors":["Huili Cai","Xiang Zhang","Xiaofeng Liu"],"pdf_url":"https://arxiv.org/pdf/2310.08848v1.pdf","comment":"Submitted to NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.08847v1","updated":"2023-10-13T04:14:51Z","published":"2023-10-13T04:14:51Z","title":"On the Over-Memorization During Natural, Robust and Catastrophic\n Overfitting","summary":" Overfitting negatively impacts the generalization ability of deep neural\nnetworks (DNNs) in both natural and adversarial training. Existing methods\nstruggle to consistently address different types of overfitting, typically\ndesigning strategies that focus separately on either natural or adversarial\npatterns. In this work, we adopt a unified perspective by solely focusing on\nnatural patterns to explore different types of overfitting. Specifically, we\nexamine the memorization effect in DNNs and reveal a shared behaviour termed\nover-memorization, which impairs their generalization capacity. This behaviour\nmanifests as DNNs suddenly becoming high-confidence in predicting certain\ntraining patterns and retaining a persistent memory for them. Furthermore, when\nDNNs over-memorize an adversarial pattern, they tend to simultaneously exhibit\nhigh-confidence prediction for the corresponding natural pattern. These\nfindings motivate us to holistically mitigate different types of overfitting by\nhindering the DNNs from over-memorization natural patterns. To this end, we\npropose a general framework, Distraction Over-Memorization (DOM), which\nexplicitly prevents over-memorization by either removing or augmenting the\nhigh-confidence natural patterns. Extensive experiments demonstrate the\neffectiveness of our proposed method in mitigating overfitting across various\ntraining paradigms.\n","authors":["Runqi Lin","Chaojian Yu","Bo Han","Tongliang Liu"],"pdf_url":"https://arxiv.org/pdf/2310.08847v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.10577v3","updated":"2023-10-13T04:05:07Z","published":"2023-06-18T14:38:29Z","title":"OpenDataVal: a Unified Benchmark for Data Valuation","summary":" Assessing the quality and impact of individual data points is critical for\nimproving model performance and mitigating undesirable biases within the\ntraining dataset. Several data valuation algorithms have been proposed to\nquantify data quality, however, there lacks a systemic and standardized\nbenchmarking system for data valuation. In this paper, we introduce\nOpenDataVal, an easy-to-use and unified benchmark framework that empowers\nresearchers and practitioners to apply and compare various data valuation\nalgorithms. OpenDataVal provides an integrated environment that includes (i) a\ndiverse collection of image, natural language, and tabular datasets, (ii)\nimplementations of eleven different state-of-the-art data valuation algorithms,\nand (iii) a prediction model API that can import any models in scikit-learn.\nFurthermore, we propose four downstream machine learning tasks for evaluating\nthe quality of data values. We perform benchmarking analysis using OpenDataVal,\nquantifying and comparing the efficacy of state-of-the-art data valuation\napproaches. We find that no single algorithm performs uniformly best across all\ntasks, and an appropriate algorithm should be employed for a user's downstream\ntask. OpenDataVal is publicly available at https://opendataval.github.io with\ncomprehensive documentation. Furthermore, we provide a leaderboard where\nresearchers can evaluate the effectiveness of their own data valuation\nalgorithms.\n","authors":["Kevin Fu Jiang","Weixin Liang","James Zou","Yongchan Kwon"],"pdf_url":"https://arxiv.org/pdf/2306.10577v3.pdf","comment":"25 pages, NeurIPS 2023 Track on Datasets and Benchmarks"},{"id":"http://arxiv.org/abs/2206.12403v2","updated":"2023-10-13T03:48:11Z","published":"2022-06-24T17:59:02Z","title":"ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings","summary":" We present a scalable approach for learning open-world object-goal navigation\n(ObjectNav) -- the task of asking a virtual robot (agent) to find any instance\nof an object in an unexplored environment (e.g., \"find a sink\"). Our approach\nis entirely zero-shot -- i.e., it does not require ObjectNav rewards or\ndemonstrations of any kind. Instead, we train on the image-goal navigation\n(ImageNav) task, in which agents find the location where a picture (i.e., goal\nimage) was captured. Specifically, we encode goal images into a multimodal,\nsemantic embedding space to enable training semantic-goal navigation\n(SemanticNav) agents at scale in unannotated 3D environments (e.g., HM3D).\nAfter training, SemanticNav agents can be instructed to find objects described\nin free-form natural language (e.g., \"sink\", \"bathroom sink\", etc.) by\nprojecting language goals into the same multimodal, semantic embedding space.\nAs a result, our approach enables open-world ObjectNav. We extensively evaluate\nour agents on three ObjectNav datasets (Gibson, HM3D, and MP3D) and observe\nabsolute improvements in success of 4.2% - 20.0% over existing zero-shot\nmethods. For reference, these gains are similar or better than the 5%\nimprovement in success between the Habitat 2020 and 2021 ObjectNav challenge\nwinners. In an open-world setting, we discover that our agents can generalize\nto compound instructions with a room explicitly mentioned (e.g., \"Find a\nkitchen sink\") and when the target room can be inferred (e.g., \"Find a sink and\na stove\").\n","authors":["Arjun Majumdar","Gunjan Aggarwal","Bhavika Devnani","Judy Hoffman","Dhruv Batra"],"pdf_url":"https://arxiv.org/pdf/2206.12403v2.pdf","comment":"code: https://github.com/gunagg/zson"},{"id":"http://arxiv.org/abs/2310.04457v2","updated":"2023-10-13T03:42:43Z","published":"2023-10-04T22:23:40Z","title":"ProGO: Probabilistic Global Optimizer","summary":" In the field of global optimization, many existing algorithms face challenges\nposed by non-convex target functions and high computational complexity or\nunavailability of gradient information. These limitations, exacerbated by\nsensitivity to initial conditions, often lead to suboptimal solutions or failed\nconvergence. This is true even for Metaheuristic algorithms designed to\namalgamate different optimization techniques to improve their efficiency and\nrobustness. To address these challenges, we develop a sequence of\nmultidimensional integration-based methods that we show to converge to the\nglobal optima under some mild regularity conditions. Our probabilistic approach\ndoes not require the use of gradients and is underpinned by a mathematically\nrigorous convergence framework anchored in the nuanced properties of nascent\noptima distribution. In order to alleviate the problem of multidimensional\nintegration, we develop a latent slice sampler that enjoys a geometric rate of\nconvergence in generating samples from the nascent optima distribution, which\nis used to approximate the global optima. The proposed Probabilistic Global\nOptimizer (ProGO) provides a scalable unified framework to approximate the\nglobal optima of any continuous function defined on a domain of arbitrary\ndimension. Empirical illustrations of ProGO across a variety of popular\nnon-convex test functions (having finite global optima) reveal that the\nproposed algorithm outperforms, by order of magnitude, many existing\nstate-of-the-art methods, including gradient-based, zeroth-order gradient-free,\nand some Bayesian Optimization methods, in term regret value and speed of\nconvergence. It is, however, to be noted that our approach may not be suitable\nfor functions that are expensive to compute.\n","authors":["Xinyu Zhang","Sujit Ghosh"],"pdf_url":"https://arxiv.org/pdf/2310.04457v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.15395v3","updated":"2023-10-13T03:20:33Z","published":"2023-09-27T04:33:09Z","title":"Model-Free, Regret-Optimal Best Policy Identification in Online CMDPs","summary":" This paper considers the best policy identification (BPI) problem in online\nConstrained Markov Decision Processes (CMDPs). We are interested in algorithms\nthat are model-free, have low regret, and identify an optimal policy with a\nhigh probability. Existing model-free algorithms for online CMDPs with\nsublinear regret and constraint violation do not provide any convergence\nguarantee to an optimal policy and provide only average performance guarantees\nwhen a policy is uniformly sampled at random from all previously used policies.\nIn this paper, we develop a new algorithm, named\nPruning-Refinement-Identification (PRI), based on a fundamental structural\nproperty of CMDPs proved in Koole(1988); Ross(1989), which we call limited\nstochasticity. The property says for a CMDP with $N$ constraints, there exists\nan optimal policy with at most $N$ stochastic decisions.\n The proposed algorithm first identifies at which step and in which state a\nstochastic decision has to be taken and then fine-tunes the distributions of\nthese stochastic decisions. PRI achieves trio objectives: (i) PRI is a\nmodel-free algorithm; and (ii) it outputs a near-optimal policy with a high\nprobability at the end of learning; and (iii) in the tabular setting, PRI\nguarantees $\\tilde{\\mathcal{O}}(\\sqrt{K})$ regret and constraint violation,\nwhich significantly improves the best existing regret bound\n$\\tilde{\\mathcal{O}}(K^{\\frac{4}{5}})$ under a model-free algorithm, where $K$\nis the total number of episodes.\n","authors":["Zihan Zhou","Honghao Wei","Lei Ying"],"pdf_url":"https://arxiv.org/pdf/2309.15395v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08836v1","updated":"2023-10-13T03:15:42Z","published":"2023-10-13T03:15:42Z","title":"A Framework for Few-Shot Policy Transfer through Observation Mapping and\n Behavior Cloning","summary":" Despite recent progress in Reinforcement Learning for robotics applications,\nmany tasks remain prohibitively difficult to solve because of the expensive\ninteraction cost. Transfer learning helps reduce the training time in the\ntarget domain by transferring knowledge learned in a source domain. Sim2Real\ntransfer helps transfer knowledge from a simulated robotic domain to a physical\ntarget domain. Knowledge transfer reduces the time required to train a task in\nthe physical world, where the cost of interactions is high. However, most\nexisting approaches assume exact correspondence in the task structure and the\nphysical properties of the two domains. This work proposes a framework for\nFew-Shot Policy Transfer between two domains through Observation Mapping and\nBehavior Cloning. We use Generative Adversarial Networks (GANs) along with a\ncycle-consistency loss to map the observations between the source and target\ndomains and later use this learned mapping to clone the successful source task\nbehavior policy to the target domain. We observe successful behavior policy\ntransfer with limited target task interactions and in cases where the source\nand target task are semantically dissimilar.\n","authors":["Yash Shukla","Bharat Kesari","Shivam Goel","Robert Wright","Jivko Sinapov"],"pdf_url":"https://arxiv.org/pdf/2310.08836v1.pdf","comment":"Paper accepted to the IROS 2023 Conference"},{"id":"http://arxiv.org/abs/2310.01404v2","updated":"2023-10-13T03:14:16Z","published":"2023-10-02T17:59:03Z","title":"H-InDex: Visual Reinforcement Learning with Hand-Informed\n Representations for Dexterous Manipulation","summary":" Human hands possess remarkable dexterity and have long served as a source of\ninspiration for robotic manipulation. In this work, we propose a human\n$\\textbf{H}$and$\\textbf{-In}$formed visual representation learning framework to\nsolve difficult $\\textbf{Dex}$terous manipulation tasks ($\\textbf{H-InDex}$)\nwith reinforcement learning. Our framework consists of three stages: (i)\npre-training representations with 3D human hand pose estimation, (ii) offline\nadapting representations with self-supervised keypoint detection, and (iii)\nreinforcement learning with exponential moving average BatchNorm. The last two\nstages only modify $0.36\\%$ parameters of the pre-trained representation in\ntotal, ensuring the knowledge from pre-training is maintained to the full\nextent. We empirically study 12 challenging dexterous manipulation tasks and\nfind that H-InDex largely surpasses strong baseline methods and the recent\nvisual foundation models for motor control. Code is available at\nhttps://yanjieze.com/H-InDex .\n","authors":["Yanjie Ze","Yuyao Liu","Ruizhe Shi","Jiaxin Qin","Zhecheng Yuan","Jiashun Wang","Huazhe Xu"],"pdf_url":"https://arxiv.org/pdf/2310.01404v2.pdf","comment":"NeurIPS 2023. Code and videos: https://yanjieze.com/H-InDex"},{"id":"http://arxiv.org/abs/2310.08833v1","updated":"2023-10-13T03:08:59Z","published":"2023-10-13T03:08:59Z","title":"Optimal Sample Complexity for Average Reward Markov Decision Processes","summary":" We settle the sample complexity of policy learning for the maximization of\nthe long run average reward associated with a uniformly ergodic Markov decision\nprocess (MDP), assuming a generative model. In this context, the existing\nliterature provides a sample complexity upper bound of $\\widetilde\nO(|S||A|t_{\\text{mix}}^2 \\epsilon^{-2})$ and a lower bound of\n$\\Omega(|S||A|t_{\\text{mix}} \\epsilon^{-2})$. In these expressions, $|S|$ and\n$|A|$ denote the cardinalities of the state and action spaces respectively,\n$t_{\\text{mix}}$ serves as a uniform upper limit for the total variation mixing\ntimes, and $\\epsilon$ signifies the error tolerance. Therefore, a notable gap\nof $t_{\\text{mix}}$ still remains to be bridged. Our primary contribution is to\nestablish an estimator for the optimal policy of average reward MDPs with a\nsample complexity of $\\widetilde O(|S||A|t_{\\text{mix}}\\epsilon^{-2})$,\neffectively reaching the lower bound in the literature. This is achieved by\ncombining algorithmic ideas in Jin and Sidford (2021) with those of Li et al.\n(2020).\n","authors":["Shengbo Wang","Jose Blanchet","Peter Glynn"],"pdf_url":"https://arxiv.org/pdf/2310.08833v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07805v2","updated":"2023-10-13T03:06:00Z","published":"2023-10-11T18:38:28Z","title":"Generative Modeling with Phase Stochastic Bridges","summary":" Diffusion models (DMs) represent state-of-the-art generative models for\ncontinuous inputs. DMs work by constructing a Stochastic Differential Equation\n(SDE) in the input space (ie, position space), and using a neural network to\nreverse it. In this work, we introduce a novel generative modeling framework\ngrounded in \\textbf{phase space dynamics}, where a phase space is defined as\n{an augmented space encompassing both position and velocity.} Leveraging\ninsights from Stochastic Optimal Control, we construct a path measure in the\nphase space that enables efficient sampling. {In contrast to DMs, our framework\ndemonstrates the capability to generate realistic data points at an early stage\nof dynamics propagation.} This early prediction sets the stage for efficient\ndata generation by leveraging additional velocity information along the\ntrajectory. On standard image generation benchmarks, our model yields favorable\nperformance over baselines in the regime of small Number of Function\nEvaluations (NFEs). Furthermore, our approach rivals the performance of\ndiffusion models equipped with efficient sampling techniques, underscoring its\npotential as a new tool generative modeling.\n","authors":["Tianrong Chen","Jiatao Gu","Laurent Dinh","Evangelos A. Theodorou","Josh Susskind","Shuangfei Zhai"],"pdf_url":"https://arxiv.org/pdf/2310.07805v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07958v2","updated":"2023-10-13T02:50:56Z","published":"2023-10-12T00:51:06Z","title":"Towards Causal Deep Learning for Vulnerability Detection","summary":" Deep learning vulnerability detection has shown promising results in recent\nyears. However, an important challenge that still blocks it from being very\nuseful in practice is that the model is not robust under perturbation and it\ncannot generalize well over the out-of-distribution (OOD) data, e.g., applying\na trained model to unseen projects in real world. We hypothesize that this is\nbecause the model learned non-robust features, e.g., variable names, that have\nspurious correlations with labels. When the perturbed and OOD datasets no\nlonger have the same spurious features, the model prediction fails. To address\nthe challenge, in this paper, we introduced causality into deep learning\nvulnerability detection. Our approach CausalVul consists of two phases. First,\nwe designed novel perturbations to discover spurious features that the model\nmay use to make predictions. Second, we applied the causal learning algorithms,\nspecifically, do-calculus, on top of existing deep learning models to\nsystematically remove the use of spurious features and thus promote causal\nbased prediction. Our results show that CausalVul consistently improved the\nmodel accuracy, robustness and OOD performance for all the state-of-the-art\nmodels and datasets we experimented. To the best of our knowledge, this is the\nfirst work that introduces do calculus based causal learning to software\nengineering models and shows it's indeed useful for improving the model\naccuracy, robustness and generalization. Our replication package is located at\nhttps://figshare.com/s/0ffda320dcb96c249ef2.\n","authors":["Md Mahbubur Rahman","Ira Ceka","Chengzhi Mao","Saikat Chakraborty","Baishakhi Ray","Wei Le"],"pdf_url":"https://arxiv.org/pdf/2310.07958v2.pdf","comment":"Accepted at ICSE 2024 (not camera-ready version)"},{"id":"http://arxiv.org/abs/2310.08823v1","updated":"2023-10-13T02:38:35Z","published":"2023-10-13T02:38:35Z","title":"Distance-rank Aware Sequential Reward Learning for Inverse Reinforcement\n Learning with Sub-optimal Demonstrations","summary":" Inverse reinforcement learning (IRL) aims to explicitly infer an underlying\nreward function based on collected expert demonstrations. Considering that\nobtaining expert demonstrations can be costly, the focus of current IRL\ntechniques is on learning a better-than-demonstrator policy using a reward\nfunction derived from sub-optimal demonstrations. However, existing IRL\nalgorithms primarily tackle the challenge of trajectory ranking ambiguity when\nlearning the reward function. They overlook the crucial role of considering the\ndegree of difference between trajectories in terms of their returns, which is\nessential for further removing reward ambiguity. Additionally, it is important\nto note that the reward of a single transition is heavily influenced by the\ncontext information within the trajectory. To address these issues, we\nintroduce the Distance-rank Aware Sequential Reward Learning (DRASRL)\nframework. Unlike existing approaches, DRASRL takes into account both the\nranking of trajectories and the degrees of dissimilarity between them to\ncollaboratively eliminate reward ambiguity when learning a sequence of\ncontextually informed reward signals. Specifically, we leverage the distance\nbetween policies, from which the trajectories are generated, as a measure to\nquantify the degree of differences between traces. This distance-aware\ninformation is then used to infer embeddings in the representation space for\nreward learning, employing the contrastive learning technique. Meanwhile, we\nintegrate the pairwise ranking loss function to incorporate ranking information\ninto the latent features. Moreover, we resort to the Transformer architecture\nto capture the contextual dependencies within the trajectories in the latent\nspace, leading to more accurate reward estimation. Through extensive\nexperimentation, our DRASRL framework demonstrates significant performance\nimprovements over previous SOTA methods.\n","authors":["Lu Li","Yuxin Pan","Ruobing Chen","Jie Liu","Zilin Wang","Yu Liu","Zhiheng Li"],"pdf_url":"https://arxiv.org/pdf/2310.08823v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.07418v2","updated":"2023-10-13T02:27:42Z","published":"2023-08-14T19:12:40Z","title":"Locally Adaptive and Differentiable Regression","summary":" Over-parameterized models like deep nets and random forests have become very\npopular in machine learning. However, the natural goals of continuity and\ndifferentiability, common in regression models, are now often ignored in modern\noverparametrized, locally-adaptive models. We propose a general framework to\nconstruct a global continuous and differentiable model based on a weighted\naverage of locally learned models in corresponding local regions. This model is\ncompetitive in dealing with data with different densities or scales of function\nvalues in different local regions. We demonstrate that when we mix kernel ridge\nand polynomial regression terms in the local models, and stitch them together\ncontinuously, we achieve faster statistical convergence in theory and improved\nperformance in various practical settings.\n","authors":["Mingxuan Han","Varun Shankar","Jeff M Phillips","Chenglong Ye"],"pdf_url":"https://arxiv.org/pdf/2308.07418v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08817v1","updated":"2023-10-13T02:06:52Z","published":"2023-10-13T02:06:52Z","title":"Exploring the relationship between response time sequence in scale\n answering process and severity of insomnia: a machine learning approach","summary":" Objectives: The study aims to investigate the relationship between insomnia\nand response time. Additionally, it aims to develop a machine learning model to\npredict the presence of insomnia in participants using response time data.\nMethods: A mobile application was designed to administer scale tests and\ncollect response time data from 2729 participants. The relationship between\nsymptom severity and response time was explored, and a machine learning model\nwas developed to predict the presence of insomnia. Results: The result revealed\na statistically significant difference (p<.001) in the total response time\nbetween participants with or without insomnia symptoms. A correlation was\nobserved between the severity of specific insomnia aspects and response times\nat the individual questions level. The machine learning model demonstrated a\nhigh predictive accuracy of 0.743 in predicting insomnia symptoms based on\nresponse time data. Conclusions: These findings highlight the potential utility\nof response time data to evaluate cognitive and psychological measures,\ndemonstrating the effectiveness of using response time as a diagnostic tool in\nthe assessment of insomnia.\n","authors":["Zhao Su","Rongxun Liu","Keyin Zhou","Xinru Wei","Ning Wang","Zexin Lin","Yuanchen Xie","Jie Wang","Fei Wang","Shenzhong Zhang","Xizhe Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.08817v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08812v1","updated":"2023-10-13T01:50:43Z","published":"2023-10-13T01:50:43Z","title":"A Nonlinear Method for time series forecasting using VMD-GARCH-LSTM\n model","summary":" Time series forecasting represents a significant and challenging task across\nvarious fields. Recently, methods based on mode decomposition have dominated\nthe forecasting of complex time series because of the advantages of capturing\nlocal characteristics and extracting intrinsic modes from data. Unfortunately,\nmost models fail to capture the implied volatilities that contain significant\ninformation. To enhance the forecasting of current, rapidly evolving, and\nvolatile time series, we propose a novel decomposition-ensemble paradigm, the\nVMD-LSTM-GARCH model. The Variational Mode Decomposition algorithm is employed\nto decompose the time series into K sub-modes. Subsequently, the GARCH model\nextracts the volatility information from these sub-modes, which serve as the\ninput for the LSTM. The numerical and volatility information of each sub-mode\nis utilized to train a Long Short-Term Memory network. This network predicts\nthe sub-mode, and then we aggregate the predictions from all sub-modes to\nproduce the output. By integrating econometric and artificial intelligence\nmethods, and taking into account both the numerical and volatility information\nof the time series, our proposed model demonstrates superior performance in\ntime series forecasting, as evidenced by the significant decrease in MSE, RMSE,\nand MAPE in our comparative experimental results.\n","authors":["Zhengtao Gui","Haoyuan Li","Sijie Xu","Yu Chen"],"pdf_url":"https://arxiv.org/pdf/2310.08812v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2301.00521v2","updated":"2023-10-13T01:47:48Z","published":"2023-01-02T04:19:56Z","title":"A Policy Optimization Method Towards Optimal-time Stability","summary":" In current model-free reinforcement learning (RL) algorithms, stability\ncriteria based on sampling methods are commonly utilized to guide policy\noptimization. However, these criteria only guarantee the infinite-time\nconvergence of the system's state to an equilibrium point, which leads to\nsub-optimality of the policy. In this paper, we propose a policy optimization\ntechnique incorporating sampling-based Lyapunov stability. Our approach enables\nthe system's state to reach an equilibrium point within an optimal time and\nmaintain stability thereafter, referred to as \"optimal-time stability\". To\nachieve this, we integrate the optimization method into the Actor-Critic\nframework, resulting in the development of the Adaptive Lyapunov-based\nActor-Critic (ALAC) algorithm. Through evaluations conducted on ten robotic\ntasks, our approach outperforms previous studies significantly, effectively\nguiding the system to generate stable patterns.\n","authors":["Shengjie Wang","Fengbo Lan","Xiang Zheng","Yuxue Cao","Oluwatosin Oseni","Haotian Xu","Tao Zhang","Yang Gao"],"pdf_url":"https://arxiv.org/pdf/2301.00521v2.pdf","comment":"27 pages, 11 figues. 7th Annual Conference on Robot Learning. 2023"},{"id":"http://arxiv.org/abs/2310.08800v1","updated":"2023-10-13T01:18:41Z","published":"2023-10-13T01:18:41Z","title":"DDMT: Denoising Diffusion Mask Transformer Models for Multivariate Time\n Series Anomaly Detection","summary":" Anomaly detection in multivariate time series has emerged as a crucial\nchallenge in time series research, with significant research implications in\nvarious fields such as fraud detection, fault diagnosis, and system state\nestimation. Reconstruction-based models have shown promising potential in\nrecent years for detecting anomalies in time series data. However, due to the\nrapid increase in data scale and dimensionality, the issues of noise and Weak\nIdentity Mapping (WIM) during time series reconstruction have become\nincreasingly pronounced. To address this, we introduce a novel Adaptive Dynamic\nNeighbor Mask (ADNM) mechanism and integrate it with the Transformer and\nDenoising Diffusion Model, creating a new framework for multivariate time\nseries anomaly detection, named Denoising Diffusion Mask Transformer (DDMT).\nThe ADNM module is introduced to mitigate information leakage between input and\noutput features during data reconstruction, thereby alleviating the problem of\nWIM during reconstruction. The Denoising Diffusion Transformer (DDT) employs\nthe Transformer as an internal neural network structure for Denoising Diffusion\nModel. It learns the stepwise generation process of time series data to model\nthe probability distribution of the data, capturing normal data patterns and\nprogressively restoring time series data by removing noise, resulting in a\nclear recovery of anomalies. To the best of our knowledge, this is the first\nmodel that combines Denoising Diffusion Model and the Transformer for\nmultivariate time series anomaly detection. Experimental evaluations were\nconducted on five publicly available multivariate time series anomaly detection\ndatasets. The results demonstrate that the model effectively identifies\nanomalies in time series data, achieving state-of-the-art performance in\nanomaly detection.\n","authors":["Chaocheng Yang","Tingyin Wang","Xuanhui Yan"],"pdf_url":"https://arxiv.org/pdf/2310.08800v1.pdf","comment":"16 pages, 9 figures"},{"id":"http://arxiv.org/abs/2310.08475v2","updated":"2023-10-13T01:12:25Z","published":"2023-10-12T16:32:44Z","title":"Can We Edit Multimodal Large Language Models?","summary":" In this paper, we focus on editing Multimodal Large Language Models (MLLMs).\nCompared to editing single-modal LLMs, multimodal model editing is more\nchallenging, which demands a higher level of scrutiny and careful consideration\nin the editing process. To facilitate research in this area, we construct a new\nbenchmark, dubbed MMEdit, for editing multimodal LLMs and establishing a suite\nof innovative metrics for evaluation. We conduct comprehensive experiments\ninvolving various model editing baselines and analyze the impact of editing\ndifferent components for multimodal LLMs. Empirically, we notice that previous\nbaselines can implement editing multimodal LLMs to some extent, but the effect\nis still barely satisfactory, indicating the potential difficulty of this task.\nWe hope that our work can provide the NLP community with insights. Code and\ndataset are available in https://github.com/zjunlp/EasyEdit.\n","authors":["Siyuan Cheng","Bozhong Tian","Qingbin Liu","Xi Chen","Yongheng Wang","Huajun Chen","Ningyu Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.08475v2.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.08795v1","updated":"2023-10-13T00:49:09Z","published":"2023-10-13T00:49:09Z","title":"Mitigating Bias for Question Answering Models by Tracking Bias Influence","summary":" Models of various NLP tasks have been shown to exhibit stereotypes, and the\nbias in the question answering (QA) models is especially harmful as the output\nanswers might be directly consumed by the end users. There have been datasets\nto evaluate bias in QA models, while bias mitigation technique for the QA\nmodels is still under-explored. In this work, we propose BMBI, an approach to\nmitigate the bias of multiple-choice QA models. Based on the intuition that a\nmodel would lean to be more biased if it learns from a biased example, we\nmeasure the bias level of a query instance by observing its influence on\nanother instance. If the influenced instance is more biased, we derive that the\nquery instance is biased. We then use the bias level detected as an\noptimization objective to form a multi-task learning setting in addition to the\noriginal QA task. We further introduce a new bias evaluation metric to quantify\nbias in a comprehensive and sensitive way. We show that our method could be\napplied to multiple QA formulations across multiple bias categories. It can\nsignificantly reduce the bias level in all 9 bias categories in the BBQ dataset\nwhile maintaining comparable QA accuracy.\n","authors":["Mingyu Derek Ma","Jiun-Yu Kao","Arpit Gupta","Yu-Hsiang Lin","Wenbo Zhao","Tagyoung Chung","Wei Wang","Kai-Wei Chang","Nanyun Peng"],"pdf_url":"https://arxiv.org/pdf/2310.08795v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08793v1","updated":"2023-10-13T00:46:12Z","published":"2023-10-13T00:46:12Z","title":"Analysis of Weather and Time Features in Machine Learning-aided ERCOT\n Load Forecasting","summary":" Accurate load forecasting is critical for efficient and reliable operations\nof the electric power system. A large part of electricity consumption is\naffected by weather conditions, making weather information an important\ndeterminant of electricity usage. Personal appliances and industry equipment\nalso contribute significantly to electricity demand with temporal patterns,\nmaking time a useful factor to consider in load forecasting. This work develops\nseveral machine learning (ML) models that take various time and weather\ninformation as part of the input features to predict the short-term system-wide\ntotal load. Ablation studies were also performed to investigate and compare the\nimpacts of different weather factors on the prediction accuracy. Actual load\nand historical weather data for the same region were processed and then used to\ntrain the ML models. It is interesting to observe that using all available\nfeatures, each of which may be correlated to the load, is unlikely to achieve\nthe best forecasting performance; features with redundancy may even decrease\nthe inference capabilities of ML models. This indicates the importance of\nfeature selection for ML models. Overall, case studies demonstrated the\neffectiveness of ML models trained with different weather and time input\nfeatures for ERCOT load forecasting.\n","authors":["Jonathan Yang","Mingjian Tuo","Jin Lu","Xingpeng Li"],"pdf_url":"https://arxiv.org/pdf/2310.08793v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07171v3","updated":"2023-10-13T00:35:33Z","published":"2023-10-11T03:39:56Z","title":"Federated Generalization via Information-Theoretic Distribution\n Diversification","summary":" Federated Learning (FL) has surged in prominence due to its capability of\ncollaborative model training without direct data sharing. However, the vast\ndisparity in local data distributions among clients, often termed the\nnon-Independent Identically Distributed (non-IID) challenge, poses a\nsignificant hurdle to FL's generalization efficacy. The scenario becomes even\nmore complex when not all clients participate in the training process, a common\noccurrence due to unstable network connections or limited computational\ncapacities. This can greatly complicate the assessment of the trained models'\ngeneralization abilities. While a plethora of recent studies has centered on\nthe generalization gap pertaining to unseen data from participating clients\nwith diverse distributions, the divergence between the training distributions\nof participating clients and the testing distributions of non-participating\nones has been largely overlooked. In response, our paper unveils an\ninformation-theoretic generalization framework for FL. Specifically, it\nquantifies generalization errors by evaluating the information entropy of local\ndistributions and discerning discrepancies across these distributions. Inspired\nby our deduced generalization bounds, we introduce a weighted aggregation\napproach and a duo of client selection strategies. These innovations aim to\nbolster FL's generalization prowess by encompassing a more varied set of client\ndata distributions. Our extensive empirical evaluations reaffirm the potency of\nour proposed methods, aligning seamlessly with our theoretical construct.\n","authors":["Zheshun Wu","Zenglin Xu","Dun Zeng","Qifan Wang"],"pdf_url":"https://arxiv.org/pdf/2310.07171v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08792v1","updated":"2023-10-13T00:34:12Z","published":"2023-10-13T00:34:12Z","title":"Incentive Mechanism Design for Distributed Ensemble Learning","summary":" Distributed ensemble learning (DEL) involves training multiple models at\ndistributed learners, and then combining their predictions to improve\nperformance. Existing related studies focus on DEL algorithm design and\noptimization but ignore the important issue of incentives, without which\nself-interested learners may be unwilling to participate in DEL. We aim to fill\nthis gap by presenting a first study on the incentive mechanism design for DEL.\nOur proposed mechanism specifies both the amount of training data and reward\nfor learners with heterogeneous computation and communication costs. One design\nchallenge is to have an accurate understanding regarding how learners'\ndiversity (in terms of training data) affects the ensemble accuracy. To this\nend, we decompose the ensemble accuracy into a diversity-precision tradeoff to\nguide the mechanism design. Another challenge is that the mechanism design\ninvolves solving a mixed-integer program with a large search space. To this\nend, we propose an alternating algorithm that iteratively updates each\nlearner's training data size and reward. We prove that under mild conditions,\nthe algorithm converges. Numerical results using MNIST dataset show an\ninteresting result: our proposed mechanism may prefer a lower level of learner\ndiversity to achieve a higher ensemble accuracy.\n","authors":["Chao Huang","Pengchao Han","Jianwei Huang"],"pdf_url":"https://arxiv.org/pdf/2310.08792v1.pdf","comment":"Accepted to IEEE GLOBECOM 2023"},{"id":"http://arxiv.org/abs/2302.04062v5","updated":"2023-10-13T00:29:41Z","published":"2023-02-08T13:59:31Z","title":"Machine Learning for Synthetic Data Generation: A Review","summary":" Machine learning heavily relies on data, but real-world applications often\nencounter various data-related issues. These include data of poor quality,\ninsufficient data points leading to under-fitting of machine learning models,\nand difficulties in data access due to concerns surrounding privacy, safety,\nand regulations. In light of these challenges, the concept of synthetic data\ngeneration emerges as a promising alternative that allows for data sharing and\nutilization in ways that real-world data cannot facilitate. This paper presents\na comprehensive systematic review of existing studies that employ machine\nlearning models for the purpose of generating synthetic data. The review\nencompasses various perspectives, starting with the applications of synthetic\ndata generation, spanning computer vision, speech, natural language processing,\nhealthcare, and business domains. Additionally, it explores different machine\nlearning methods, with particular emphasis on neural network architectures and\ndeep generative models. The paper also addresses the crucial aspects of privacy\nand fairness concerns related to synthetic data generation. Furthermore, this\nstudy identifies the challenges and opportunities prevalent in this emerging\nfield, shedding light on the potential avenues for future research. By delving\ninto the intricacies of synthetic data generation, this paper aims to\ncontribute to the advancement of knowledge and inspire further exploration in\nsynthetic data generation.\n","authors":["Yingzhou Lu","Minjie Shen","Huazheng Wang","Capucine van Rechem","Wenqi Wei"],"pdf_url":"https://arxiv.org/pdf/2302.04062v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08790v1","updated":"2023-10-13T00:25:21Z","published":"2023-10-13T00:25:21Z","title":"Price of Stability in Quality-Aware Federated Learning","summary":" Federated Learning (FL) is a distributed machine learning scheme that enables\nclients to train a shared global model without exchanging local data. The\npresence of label noise can severely degrade the FL performance, and some\nexisting studies have focused on algorithm design for label denoising. However,\nthey ignored the important issue that clients may not apply costly label\ndenoising strategies due to them being self-interested and having heterogeneous\nvaluations on the FL performance. To fill this gap, we model the clients'\ninteractions as a novel label denoising game and characterize its equilibrium.\nWe also analyze the price of stability, which quantifies the difference in the\nsystem performance (e.g., global model accuracy, social welfare) between the\nequilibrium outcome and the socially optimal solution. We prove that the\nequilibrium outcome always leads to a lower global model accuracy than the\nsocially optimal solution does. We further design an efficient algorithm to\ncompute the socially optimal solution. Numerical experiments on MNIST dataset\nshow that the price of stability increases as the clients' data become noisier,\ncalling for an effective incentive mechanism.\n","authors":["Yizhou Yan","Xinyu Tang","Chao Huang","Ming Tang"],"pdf_url":"https://arxiv.org/pdf/2310.08790v1.pdf","comment":"Accepted to IEEE GLOBECOM 2023"},{"id":"http://arxiv.org/abs/2310.08782v1","updated":"2023-10-13T00:07:49Z","published":"2023-10-13T00:07:49Z","title":"Selectivity Drives Productivity: Efficient Dataset Pruning for Enhanced\n Transfer Learning","summary":" Massive data is often considered essential for deep learning applications,\nbut it also incurs significant computational and infrastructural costs.\nTherefore, dataset pruning (DP) has emerged as an effective way to improve data\nefficiency by identifying and removing redundant training samples without\nsacrificing performance. In this work, we aim to address the problem of DP for\ntransfer learning, i.e., how to prune a source dataset for improved pretraining\nefficiency and lossless finetuning accuracy on downstream target tasks. To our\nbest knowledge, the problem of DP for transfer learning remains open, as\nprevious studies have primarily addressed DP and transfer learning as separate\nproblems. By contrast, we establish a unified viewpoint to integrate DP with\ntransfer learning and find that existing DP methods are not suitable for the\ntransfer learning paradigm. We then propose two new DP methods, label mapping\nand feature mapping, for supervised and self-supervised pretraining settings\nrespectively, by revisiting the DP problem through the lens of source-target\ndomain mapping. Furthermore, we demonstrate the effectiveness of our approach\non numerous transfer learning tasks. We show that source data classes can be\npruned by up to 40% ~ 80% without sacrificing downstream performance, resulting\nin a significant 2 ~ 5 times speed-up during the pretraining stage. Besides,\nour proposal exhibits broad applicability and can improve other computationally\nintensive transfer learning techniques, such as adversarial pretraining. Codes\nare available at https://github.com/OPTML-Group/DP4TL.\n","authors":["Yihua Zhang","Yimeng Zhang","Aochuan Chen","Jinghan Jia","Jiancheng Liu","Gaowen Liu","Mingyi Hong","Shiyu Chang","Sijia Liu"],"pdf_url":"https://arxiv.org/pdf/2310.08782v1.pdf","comment":"Thirty-seventh Conference on Neural Information Processing Systems\n (NeurIPS 2023)"}],"Multimedia":[{"id":"http://arxiv.org/abs/2310.09147v1","updated":"2023-10-13T14:39:34Z","published":"2023-10-13T14:39:34Z","title":"Exploring Sparse Spatial Relation in Graph Inference for Text-Based VQA","summary":" Text-based visual question answering (TextVQA) faces the significant\nchallenge of avoiding redundant relational inference. To be specific, a large\nnumber of detected objects and optical character recognition (OCR) tokens\nresult in rich visual relationships. Existing works take all visual\nrelationships into account for answer prediction. However, there are three\nobservations: (1) a single subject in the images can be easily detected as\nmultiple objects with distinct bounding boxes (considered repetitive objects).\nThe associations between these repetitive objects are superfluous for answer\nreasoning; (2) two spatially distant OCR tokens detected in the image\nfrequently have weak semantic dependencies for answer reasoning; and (3) the\nco-existence of nearby objects and tokens may be indicative of important visual\ncues for predicting answers. Rather than utilizing all of them for answer\nprediction, we make an effort to identify the most important connections or\neliminate redundant ones. We propose a sparse spatial graph network (SSGN) that\nintroduces a spatially aware relation pruning technique to this task. As\nspatial factors for relation measurement, we employ spatial distance, geometric\ndimension, overlap area, and DIoU for spatially aware pruning. We consider\nthree visual relationships for graph learning: object-object, OCR-OCR tokens,\nand object-OCR token relationships. SSGN is a progressive graph learning\narchitecture that verifies the pivotal relations in the correlated object-token\nsparse graph, and then in the respective object-based sparse graph and\ntoken-based sparse graph. Experiment results on TextVQA and ST-VQA datasets\ndemonstrate that SSGN achieves promising performances. And some visualization\nresults further demonstrate the interpretability of our method.\n","authors":["Sheng Zhou","Dan Guo","Jia Li","Xun Yang","Meng Wang"],"pdf_url":"https://arxiv.org/pdf/2310.09147v1.pdf","comment":"Accepted by TIP 2023"},{"id":"http://arxiv.org/abs/2306.09675v3","updated":"2023-10-13T13:48:58Z","published":"2023-06-16T08:13:41Z","title":"Multi-View Class Incremental Learning","summary":" Multi-view learning (MVL) has gained great success in integrating information\nfrom multiple perspectives of a dataset to improve downstream task performance.\nTo make MVL methods more practical in an open-ended environment, this paper\ninvestigates a novel paradigm called multi-view class incremental learning\n(MVCIL), where a single model incrementally classifies new classes from a\ncontinual stream of views, requiring no access to earlier views of data.\nHowever, MVCIL is challenged by the catastrophic forgetting of old information\nand the interference with learning new concepts. To address this, we first\ndevelop a randomization-based representation learning technique serving for\nfeature extraction to guarantee their separate view-optimal working states,\nduring which multiple views belonging to a class are presented sequentially;\nThen, we integrate them one by one in the orthogonality fusion subspace spanned\nby the extracted features; Finally, we introduce selective weight consolidation\nfor learning-without-forgetting decision-making while encountering new classes.\nExtensive experiments on synthetic and real-world datasets validate the\neffectiveness of our approach.\n","authors":["Depeng Li","Tianqi Wang","Junwei Chen","Kenji Kawaguchi","Cheng Lian","Zhigang Zeng"],"pdf_url":"https://arxiv.org/pdf/2306.09675v3.pdf","comment":"Accepted to Information Fusion"},{"id":"http://arxiv.org/abs/2310.09036v1","updated":"2023-10-13T11:57:04Z","published":"2023-10-13T11:57:04Z","title":"MM-BigBench: Evaluating Multimodal Models on Multimodal Content\n Comprehension Tasks","summary":" The popularity of multimodal large language models (MLLMs) has triggered a\nrecent surge in research efforts dedicated to evaluating these models.\nNevertheless, existing evaluation studies of MLLMs primarily focus on the\ncomprehension and reasoning of unimodal (vision) content, neglecting\nperformance evaluations in the domain of multimodal (vision-language) content\nunderstanding. Beyond multimodal reasoning, tasks related to multimodal content\ncomprehension necessitate a profound understanding of multimodal contexts,\nachieved through the multimodal interaction to obtain a final answer. In this\npaper, we introduce a comprehensive assessment framework called MM-BigBench,\nwhich incorporates a diverse range of metrics to offer an extensive evaluation\nof the performance of various models and instructions across a wide spectrum of\ndiverse multimodal content comprehension tasks. Consequently, our work\ncomplements research on the performance of MLLMs in multimodal comprehension\ntasks, achieving a more comprehensive and holistic evaluation of MLLMs. To\nbegin, we employ the Best Performance metric to ascertain each model's\nperformance upper bound on different datasets. Subsequently, the Mean Relative\nGain metric offers an assessment of the overall performance of various models\nand instructions, while the Stability metric measures their sensitivity.\nFurthermore, previous research centers on evaluating models independently or\nsolely assessing instructions, neglecting the adaptability between models and\ninstructions. We propose the Adaptability metric to quantify the adaptability\nbetween models and instructions. Our paper evaluates a total of 20 language\nmodels (14 MLLMs) on 14 multimodal datasets spanning 6 tasks, with 10\ninstructions for each task, and derives novel insights. Our code will be\nreleased at https://github.com/declare-lab/MM-BigBench.\n","authors":["Xiaocui Yang","Wenfang Wu","Shi Feng","Ming Wang","Daling Wang","Yang Li","Qi Sun","Yifei Zhang","Xiaoming Fu","Soujanya Poria"],"pdf_url":"https://arxiv.org/pdf/2310.09036v1.pdf","comment":"Underview"},{"id":"http://arxiv.org/abs/2310.08981v1","updated":"2023-10-13T09:57:09Z","published":"2023-10-13T09:57:09Z","title":"Low-latency Speech Enhancement via Speech Token Generation","summary":" Existing deep learning based speech enhancement mainly employ a data-driven\napproach, which leverage large amounts of data with a variety of noise types to\nachieve noise removal from noisy signal. However, the high dependence on the\ndata limits its generalization on the unseen complex noises in real-life\nenvironment. In this paper, we focus on the low-latency scenario and regard\nspeech enhancement as a speech generation problem conditioned on the noisy\nsignal, where we generate clean speech instead of identifying and removing\nnoises. Specifically, we propose a conditional generative framework for speech\nenhancement, which models clean speech by acoustic codes of a neural speech\ncodec and generates the speech codes conditioned on past noisy frames in an\nauto-regressive way. Moreover, we propose an explicit-alignment approach to\nalign noisy frames with the generated speech tokens to improve the robustness\nand scalability to different input lengths. Different from other methods that\nleverage multiple stages to generate speech codes, we leverage a single-stage\nspeech generation approach based on the TF-Codec neural codec to achieve high\nspeech quality with low latency. Extensive results on both synthetic and\nreal-recorded test set show its superiority over data-driven approaches in\nterms of noise robustness and temporal speech coherence.\n","authors":["Huaying Xue","Xiulian Peng","Yan Lu"],"pdf_url":"https://arxiv.org/pdf/2310.08981v1.pdf","comment":"5 pages"},{"id":"http://arxiv.org/abs/2112.09726v3","updated":"2023-10-13T08:10:41Z","published":"2021-12-17T19:22:01Z","title":"Soundify: Matching Sound Effects to Video","summary":" In the art of video editing, sound helps add character to an object and\nimmerse the viewer within a space. Through formative interviews with\nprofessional editors (N=10), we found that the task of adding sounds to video\ncan be challenging. This paper presents Soundify, a system that assists editors\nin matching sounds to video. Given a video, Soundify identifies matching\nsounds, synchronizes the sounds to the video, and dynamically adjusts panning\nand volume to create spatial audio. In a human evaluation study (N=889), we\nshow that Soundify is capable of matching sounds to video out-of-the-box for a\ndiverse range of audio categories. In a within-subjects expert study (N=12), we\ndemonstrate the usefulness of Soundify in helping video editors match sounds to\nvideo with lighter workload, reduced task completion time, and improved\nusability.\n","authors":["David Chuan-En Lin","Anastasis Germanidis","Cristóbal Valenzuela","Yining Shi","Nikolas Martelaro"],"pdf_url":"https://arxiv.org/pdf/2112.09726v3.pdf","comment":"Full paper in UIST 2023; Short paper in NeurIPS 2021 ML4CD Workshop;\n Online demo: http://soundify.cc"},{"id":"http://arxiv.org/abs/2310.08475v2","updated":"2023-10-13T01:12:25Z","published":"2023-10-12T16:32:44Z","title":"Can We Edit Multimodal Large Language Models?","summary":" In this paper, we focus on editing Multimodal Large Language Models (MLLMs).\nCompared to editing single-modal LLMs, multimodal model editing is more\nchallenging, which demands a higher level of scrutiny and careful consideration\nin the editing process. To facilitate research in this area, we construct a new\nbenchmark, dubbed MMEdit, for editing multimodal LLMs and establishing a suite\nof innovative metrics for evaluation. We conduct comprehensive experiments\ninvolving various model editing baselines and analyze the impact of editing\ndifferent components for multimodal LLMs. Empirically, we notice that previous\nbaselines can implement editing multimodal LLMs to some extent, but the effect\nis still barely satisfactory, indicating the potential difficulty of this task.\nWe hope that our work can provide the NLP community with insights. Code and\ndataset are available in https://github.com/zjunlp/EasyEdit.\n","authors":["Siyuan Cheng","Bozhong Tian","Qingbin Liu","Xi Chen","Yongheng Wang","Huajun Chen","Ningyu Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.08475v2.pdf","comment":"EMNLP 2023"}]},"2023-10-17T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2310.11454v1","updated":"2023-10-17T17:59:46Z","published":"2023-10-17T17:59:46Z","title":"VeRA: Vector-based Random Matrix Adaptation","summary":" Low-rank adapation (LoRA) is a popular method that reduces the number of\ntrainable parameters when finetuning large language models, but still faces\nacute storage challenges when scaling to even larger models or deploying\nnumerous per-user or per-task adapted models. In this work, we present\nVector-based Random Matrix Adaptation (VeRA), which reduces the number of\ntrainable parameters by 10x compared to LoRA, yet maintains the same\nperformance. It achieves this by using a single pair of low-rank matrices\nshared across all layers and learning small scaling vectors instead. We\ndemonstrate its effectiveness on the GLUE and E2E benchmarks, and show its\napplication in instruction-following with just 1.4M parameters using the Llama2\n7B model.\n","authors":["Dawid Jan Kopiczko","Tijmen Blankevoort","Yuki Markus Asano"],"pdf_url":"https://arxiv.org/pdf/2310.11454v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11453v1","updated":"2023-10-17T17:59:15Z","published":"2023-10-17T17:59:15Z","title":"BitNet: Scaling 1-bit Transformers for Large Language Models","summary":" The increasing size of large language models has posed challenges for\ndeployment and raised concerns about environmental impact due to high energy\nconsumption. In this work, we introduce BitNet, a scalable and stable 1-bit\nTransformer architecture designed for large language models. Specifically, we\nintroduce BitLinear as a drop-in replacement of the nn.Linear layer in order to\ntrain 1-bit weights from scratch. Experimental results on language modeling\nshow that BitNet achieves competitive performance while substantially reducing\nmemory footprint and energy consumption, compared to state-of-the-art 8-bit\nquantization methods and FP16 Transformer baselines. Furthermore, BitNet\nexhibits a scaling law akin to full-precision Transformers, suggesting its\npotential for effective scaling to even larger language models while\nmaintaining efficiency and performance benefits.\n","authors":["Hongyu Wang","Shuming Ma","Li Dong","Shaohan Huang","Huaijie Wang","Lingxiao Ma","Fan Yang","Ruiping Wang","Yi Wu","Furu Wei"],"pdf_url":"https://arxiv.org/pdf/2310.11453v1.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2310.11451v1","updated":"2023-10-17T17:58:34Z","published":"2023-10-17T17:58:34Z","title":"Seeking Neural Nuggets: Knowledge Transfer in Large Language Models from\n a Parametric Perspective","summary":" Large Language Models (LLMs) inherently encode a wealth of knowledge within\ntheir parameters through pre-training on extensive corpora. While prior\nresearch has delved into operations on these parameters to manipulate the\nunderlying implicit knowledge (encompassing detection, editing, and merging),\nthere remains an ambiguous understanding regarding their transferability across\nmodels with varying scales. In this paper, we seek to empirically investigate\nknowledge transfer from larger to smaller models through a parametric\nperspective. To achieve this, we employ sensitivity-based techniques to extract\nand align knowledge-specific parameters between different LLMs. Moreover, the\nLoRA module is used as the intermediary mechanism for injecting the extracted\nknowledge into smaller models. Evaluations across four benchmarks validate the\nefficacy of our proposed method. Our findings highlight the critical factors\ncontributing to the process of parametric knowledge transfer, underscoring the\ntransferability of model parameters across LLMs of different scales. We release\ncode and data at \\url{https://github.com/maszhongming/ParaKnowTransfer}.\n","authors":["Ming Zhong","Chenxin An","Weizhu Chen","Jiawei Han","Pengcheng He"],"pdf_url":"https://arxiv.org/pdf/2310.11451v1.pdf","comment":"Preprint"},{"id":"http://arxiv.org/abs/2310.11446v1","updated":"2023-10-17T17:56:18Z","published":"2023-10-17T17:56:18Z","title":"Functional Invariants to Watermark Large Transformers","summary":" The rapid growth of transformer-based models increases the concerns about\ntheir integrity and ownership insurance. Watermarking addresses this issue by\nembedding a unique identifier into the model, while preserving its performance.\nHowever, most existing approaches require to optimize the weights to imprint\nthe watermark signal, which is not suitable at scale due to the computational\ncost. This paper explores watermarks with virtually no computational cost,\napplicable to a non-blind white-box setting (assuming access to both the\noriginal and watermarked networks). They generate functionally equivalent\ncopies by leveraging the models' invariance, via operations like dimension\npermutations or scaling/unscaling. This enables to watermark models without any\nchange in their outputs and remains stealthy. Experiments demonstrate the\neffectiveness of the approach and its robustness against various model\ntransformations (fine-tuning, quantization, pruning), making it a practical\nsolution to protect the integrity of large models.\n","authors":["Fernandez Pierre","Couairon Guillaume","Furon Teddy","Douze Matthijs"],"pdf_url":"https://arxiv.org/pdf/2310.11446v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11441v1","updated":"2023-10-17T17:51:31Z","published":"2023-10-17T17:51:31Z","title":"Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V","summary":" We present Set-of-Mark (SoM), a new visual prompting method, to unleash the\nvisual grounding abilities of large multimodal models (LMMs), such as GPT-4V.\nAs illustrated in Fig. 1 (right), we employ off-the-shelf interactive\nsegmentation models, such as SAM, to partition an image into regions at\ndifferent levels of granularity, and overlay these regions with a set of marks\ne.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can\nanswer the questions that require visual grounding. We perform a comprehensive\nempirical study to validate the effectiveness of SoM on a wide range of\nfine-grained vision and multimodal tasks. For example, our experiments show\nthat GPT-4V with SoM outperforms the state-of-the-art fully-finetuned referring\nsegmentation model on RefCOCOg in a zero-shot setting.\n","authors":["Jianwei Yang","Hao Zhang","Feng Li","Xueyan Zou","Chunyuan Li","Jianfeng Gao"],"pdf_url":"https://arxiv.org/pdf/2310.11441v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.09896v4","updated":"2023-10-17T17:51:27Z","published":"2023-06-16T15:13:17Z","title":"Is Self-Repair a Silver Bullet for Code Generation?","summary":" Large language models have shown remarkable aptitude in code generation, but\nstill struggle on challenging tasks. Self-repair -- in which the model debugs\nand fixes mistakes in its own code -- has recently become a popular way to\nboost performance in these settings. However, only very limited studies on how\nand when self-repair works effectively exist in the literature, and one might\nwonder to what extent a model is really capable of repairing mistakes in code\nwhich was originally generated by that very same model. In this paper, we\nanalyze Code Llama, GPT-3.5 and GPT-4's ability to perform self-repair on\nproblems taken from HumanEval or APPS, finding that when the cost of carrying\nout repair is taken into account, gains are often modest, vary significantly\nbetween subsets of the data, and are sometimes not present at all. We\nhypothesize that this is because self-repair is bottlenecked by the model's\nability to provide feedback on its own code; boosting the feedback with\nstronger models, we observe performance gains even in settings where the model\ndoes not benefit from self-repair. Finally, we find that providing the model\nwith feedback from human participants greatly benefits repair even for GPT-4,\nand carry out a brief qualitative analysis of the differences observed.\n","authors":["Theo X. Olausson","Jeevana Priya Inala","Chenglong Wang","Jianfeng Gao","Armando Solar-Lezama"],"pdf_url":"https://arxiv.org/pdf/2306.09896v4.pdf","comment":"Added experiments for HumanEval (dataset) and Code Llama (model)"},{"id":"http://arxiv.org/abs/2305.14260v2","updated":"2023-10-17T17:46:41Z","published":"2023-05-23T17:12:09Z","title":"R2H: Building Multimodal Navigation Helpers that Respond to Help\n Requests","summary":" Intelligent navigation-helper agents are critical as they can navigate users\nin unknown areas through environmental awareness and conversational ability,\nserving as potential accessibility tools for individuals with disabilities. In\nthis work, we first introduce a novel benchmark, Respond to Help Requests\n(R2H), to promote the development of multi-modal navigation helpers capable of\nresponding to requests for help, utilizing existing dialog-based embodied\ndatasets. R2H mainly includes two tasks: (1) Respond to Dialog History (RDH),\nwhich assesses the helper agent's ability to generate informative responses\nbased on a given dialog history, and (2) Respond during Interaction (RdI),\nwhich evaluates the effectiveness and efficiency of the response during\nconsistent cooperation with a task performer. Furthermore, we explore two\napproaches to construct the navigation-helper agent, including fine-tuning a\nnovel task-oriented multi-modal response generation model that can see and\nrespond, named SeeRee, and employing a multi-modal large language model in a\nzero-shot manner. Analysis of the task and method was conducted based on both\nautomatic benchmarking and human evaluations. Project website:\nhttps://sites.google.com/view/response2helprequests/home.\n","authors":["Yue Fan","Jing Gu","Kaizhi Zheng","Xin Eric Wang"],"pdf_url":"https://arxiv.org/pdf/2305.14260v2.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.11430v1","updated":"2023-10-17T17:40:21Z","published":"2023-10-17T17:40:21Z","title":"An Empirical Study of Translation Hypothesis Ensembling with Large\n Language Models","summary":" Large language models (LLMs) are becoming a one-fits-many solution, but they\nsometimes hallucinate or produce unreliable output. In this paper, we\ninvestigate how hypothesis ensembling can improve the quality of the generated\ntext for the specific problem of LLM-based machine translation. We experiment\nwith several techniques for ensembling hypotheses produced by LLMs such as\nChatGPT, LLaMA, and Alpaca. We provide a comprehensive study along multiple\ndimensions, including the method to generate hypotheses (multiple prompts,\ntemperature-based sampling, and beam search) and the strategy to produce the\nfinal translation (instruction-based, quality-based reranking, and minimum\nBayes risk (MBR) decoding). Our results show that MBR decoding is a very\neffective method, that translation quality can be improved using a small number\nof samples, and that instruction tuning has a strong impact on the relation\nbetween the diversity of the hypotheses and the sampling temperature.\n","authors":["António Farinhas","José G. C. de Souza","André F. T. Martins"],"pdf_url":"https://arxiv.org/pdf/2310.11430v1.pdf","comment":"EMNLP 2023 (main conference)"},{"id":"http://arxiv.org/abs/2309.16039v2","updated":"2023-10-17T17:32:17Z","published":"2023-09-27T21:41:49Z","title":"Effective Long-Context Scaling of Foundation Models","summary":" We present a series of long-context LLMs that support effective context\nwindows of up to 32,768 tokens. Our model series are built through continual\npretraining from Llama 2 with longer training sequences and on a dataset where\nlong texts are upsampled. We perform extensive evaluation on language modeling,\nsynthetic context probing tasks, and a wide range of research benchmarks. On\nresearch benchmarks, our models achieve consistent improvements on most regular\ntasks and significant improvements on long-context tasks over Llama 2. Notably,\nwith a cost-effective instruction tuning procedure that does not require\nhuman-annotated long instruction data, the 70B variant can already surpass\ngpt-3.5-turbo-16k's overall performance on a suite of long-context tasks.\nAlongside these results, we provide an in-depth analysis on the individual\ncomponents of our method. We delve into Llama's position encodings and discuss\nits limitation in modeling long dependencies. We also examine the impact of\nvarious design choices in the pretraining process, including the data mix and\nthe training curriculum of sequence lengths -- our ablation experiments suggest\nthat having abundant long texts in the pretrain dataset is not the key to\nachieving strong performance, and we empirically verify that long context\ncontinual pretraining is more efficient and similarly effective compared to\npretraining from scratch with long sequences.\n","authors":["Wenhan Xiong","Jingyu Liu","Igor Molybog","Hejia Zhang","Prajjwal Bhargava","Rui Hou","Louis Martin","Rashi Rungta","Karthik Abinav Sankararaman","Barlas Oguz","Madian Khabsa","Han Fang","Yashar Mehdad","Sharan Narang","Kshitiz Malik","Angela Fan","Shruti Bhosale","Sergey Edunov","Mike Lewis","Sinong Wang","Hao Ma"],"pdf_url":"https://arxiv.org/pdf/2309.16039v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.10066v2","updated":"2023-10-17T17:24:36Z","published":"2023-09-18T18:33:40Z","title":"Automatic Personalized Impression Generation for PET Reports Using Large\n Language Models","summary":" In this study, we aimed to determine if fine-tuned large language models\n(LLMs) can generate accurate, personalized impressions for whole-body PET\nreports. Twelve language models were trained on a corpus of PET reports using\nthe teacher-forcing algorithm, with the report findings as input and the\nclinical impressions as reference. An extra input token encodes the reading\nphysician's identity, allowing models to learn physician-specific reporting\nstyles. Our corpus comprised 37,370 retrospective PET reports collected from\nour institution between 2010 and 2022. To identify the best LLM, 30 evaluation\nmetrics were benchmarked against quality scores from two nuclear medicine (NM)\nphysicians, with the most aligned metrics selecting the model for expert\nevaluation. In a subset of data, model-generated impressions and original\nclinical impressions were assessed by three NM physicians according to 6\nquality dimensions (3-point scale) and an overall utility score (5-point\nscale). Each physician reviewed 12 of their own reports and 12 reports from\nother physicians. Bootstrap resampling was used for statistical analysis. Of\nall evaluation metrics, domain-adapted BARTScore and PEGASUSScore showed the\nhighest Spearman's rank correlations (0.568 and 0.563) with physician\npreferences. Based on these metrics, the fine-tuned PEGASUS model was selected\nas the top LLM. When physicians reviewed PEGASUS-generated impressions in their\nown style, 89% were considered clinically acceptable, with a mean utility score\nof 4.08 out of 5. Physicians rated these personalized impressions as comparable\nin overall utility to the impressions dictated by other physicians (4.03,\nP=0.41). In conclusion, personalized impressions generated by PEGASUS were\nclinically useful, highlighting its potential to expedite PET reporting.\n","authors":["Xin Tie","Muheon Shin","Ali Pirasteh","Nevein Ibrahim","Zachary Huemann","Sharon M. Castellino","Kara M. Kelly","John Garrett","Junjie Hu","Steve Y. Cho","Tyler J. Bradshaw"],"pdf_url":"https://arxiv.org/pdf/2309.10066v2.pdf","comment":"25 pages in total. 6 figures and 3 tables in the main body. The\n manuscript has been submitted to a journal for potential publication"},{"id":"http://arxiv.org/abs/2309.15714v2","updated":"2023-10-17T17:17:59Z","published":"2023-09-27T15:12:08Z","title":"Integrating LLM, EEG, and Eye-Tracking Biomarker Analysis for Word-Level\n Neural State Classification in Semantic Inference Reading Comprehension","summary":" With the recent proliferation of large language models (LLMs), such as\nGenerative Pre-trained Transformers (GPT), there has been a significant shift\nin exploring human and machine comprehension of semantic language meaning. This\nshift calls for interdisciplinary research that bridges cognitive science and\nnatural language processing (NLP). This pilot study aims to provide insights\ninto individuals' neural states during a semantic relation\nreading-comprehension task. We propose jointly analyzing LLMs, eye-gaze, and\nelectroencephalographic (EEG) data to study how the brain processes words with\nvarying degrees of relevance to a keyword during reading. We also use a feature\nengineering approach to improve the fixation-related EEG data classification\nwhile participants read words with high versus low relevance to the keyword.\nThe best validation accuracy in this word-level classification is over 60\\%\nacross 12 subjects. Words of high relevance to the inference keyword had\nsignificantly more eye fixations per word: 1.0584 compared to 0.6576 when\nexcluding no-fixation words, and 1.5126 compared to 1.4026 when including them.\nThis study represents the first attempt to classify brain states at a word\nlevel using LLM knowledge. It provides valuable insights into human cognitive\nabilities and the realm of Artificial General Intelligence (AGI), and offers\nguidance for developing potential reading-assisted technologies.\n","authors":["Yuhong Zhang","Qin Li","Sujal Nahata","Tasnia Jamal","Shih-kuen Cheng","Gert Cauwenberghs","Tzyy-Ping Jung"],"pdf_url":"https://arxiv.org/pdf/2309.15714v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.13812v2","updated":"2023-10-17T17:07:29Z","published":"2023-05-23T08:28:38Z","title":"Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for\n Improved Vision-Language Compositionality","summary":" Contrastively trained vision-language models have achieved remarkable\nprogress in vision and language representation learning, leading to\nstate-of-the-art models for various downstream multimodal tasks. However,\nrecent research has highlighted severe limitations of these models in their\nability to perform compositional reasoning over objects, attributes, and\nrelations. Scene graphs have emerged as an effective way to understand images\ncompositionally. These are graph-structured semantic representations of images\nthat contain objects, their attributes, and relations with other objects in a\nscene. In this work, we consider the scene graph parsed from text as a proxy\nfor the image scene graph and propose a graph decomposition and augmentation\nframework along with a coarse-to-fine contrastive learning objective between\nimages and text that aligns sentences of various complexities to the same\nimage. Along with this, we propose novel negative mining techniques in the\nscene graph space for improving attribute binding and relation understanding.\nThrough extensive experiments, we demonstrate the effectiveness of our approach\nthat significantly improves attribute binding, relation understanding,\nsystematic generalization, and productivity on multiple recently proposed\nbenchmarks (For example, improvements upto $18\\%$ for systematic\ngeneralization, $16.5\\%$ for relation understanding over a strong baseline),\nwhile achieving similar or better performance than CLIP on various general\nmultimodal tasks.\n","authors":["Harman Singh","Pengchuan Zhang","Qifan Wang","Mengjiao Wang","Wenhan Xiong","Jingfei Du","Yu Chen"],"pdf_url":"https://arxiv.org/pdf/2305.13812v2.pdf","comment":"EMNLP 2023 (main)"},{"id":"http://arxiv.org/abs/2310.11398v1","updated":"2023-10-17T17:06:26Z","published":"2023-10-17T17:06:26Z","title":"Neural Attention: Enhancing QKV Calculation in Self-Attention Mechanism\n with Neural Networks","summary":" In the realm of deep learning, the self-attention mechanism has substantiated\nits pivotal role across a myriad of tasks, encompassing natural language\nprocessing and computer vision. Despite achieving success across diverse\napplications, the traditional self-attention mechanism primarily leverages\nlinear transformations for the computation of query, key, and value (QKV),\nwhich may not invariably be the optimal choice under specific circumstances.\nThis paper probes into a novel methodology for QKV computation-implementing a\nspecially-designed neural network structure for the calculation. Utilizing a\nmodified Marian model, we conducted experiments on the IWSLT 2017\nGerman-English translation task dataset and juxtaposed our method with the\nconventional approach. The experimental results unveil a significant\nenhancement in BLEU scores with our method. Furthermore, our approach also\nmanifested superiority when training the Roberta model with the Wikitext-103\ndataset, reflecting a notable reduction in model perplexity compared to its\noriginal counterpart. These experimental outcomes not only validate the\nefficacy of our method but also reveal the immense potential in optimizing the\nself-attention mechanism through neural network-based QKV computation, paving\nthe way for future research and practical applications. The source code and\nimplementation details for our proposed method can be accessed at\nhttps://github.com/ocislyjrti/NeuralAttention.\n","authors":["Muhan Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.11398v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.14324v2","updated":"2023-10-17T16:33:33Z","published":"2023-05-23T17:54:57Z","title":"Ties Matter: Meta-Evaluating Modern Metrics with Pairwise Accuracy and\n Tie Calibration","summary":" Kendall's tau is frequently used to meta-evaluate how well machine\ntranslation (MT) evaluation metrics score individual translations. Its focus on\npairwise score comparisons is intuitive but raises the question of how ties\nshould be handled, a gray area that has motivated different variants in the\nliterature. We demonstrate that, in settings like modern MT meta-evaluation,\nexisting variants have weaknesses arising from their handling of ties, and in\nsome situations can even be gamed. We propose instead to meta-evaluate metrics\nwith a version of pairwise accuracy that gives metrics credit for correctly\npredicting ties, in combination with a tie calibration procedure that\nautomatically introduces ties into metric scores, enabling fair comparison\nbetween metrics that do and do not predict ties. We argue and provide\nexperimental evidence that these modifications lead to fairer ranking-based\nassessments of metric performance.\n","authors":["Daniel Deutsch","George Foster","Markus Freitag"],"pdf_url":"https://arxiv.org/pdf/2305.14324v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11379v1","updated":"2023-10-17T16:22:18Z","published":"2023-10-17T16:22:18Z","title":"Robust Wake-Up Word Detection by Two-stage Multi-resolution Ensembles","summary":" Voice-based interfaces rely on a wake-up word mechanism to initiate\ncommunication with devices. However, achieving a robust, energy-efficient, and\nfast detection remains a challenge. This paper addresses these real production\nneeds by enhancing data with temporal alignments and using detection based on\ntwo phases with multi-resolution. It employs two models: a lightweight\non-device model for real-time processing of the audio stream and a verification\nmodel on the server-side, which is an ensemble of heterogeneous architectures\nthat refine detection. This scheme allows the optimization of two operating\npoints. To protect privacy, audio features are sent to the cloud instead of raw\naudio. The study investigated different parametric configurations for feature\nextraction to select one for on-device detection and another for the\nverification model. Furthermore, thirteen different audio classifiers were\ncompared in terms of performance and inference time. The proposed ensemble\noutperforms our stronger classifier in every noise condition.\n","authors":["Fernando López","Jordi Luque","Carlos Segura","Pablo Gómez"],"pdf_url":"https://arxiv.org/pdf/2310.11379v1.pdf","comment":"5 pages, 3 figures"},{"id":"http://arxiv.org/abs/2308.01263v2","updated":"2023-10-17T16:21:55Z","published":"2023-08-02T16:30:40Z","title":"XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in\n Large Language Models","summary":" Without proper safeguards, large language models will readily follow\nmalicious instructions and generate toxic content. This risk motivates safety\nefforts such as red-teaming and large-scale feedback learning, which aim to\nmake models both helpful and harmless. However, there is a tension between\nthese two objectives, since harmlessness requires models to refuse to comply\nwith unsafe prompts, and thus not be helpful. Recent anecdotal evidence\nsuggests that some models may have struck a poor balance, so that even clearly\nsafe prompts are refused if they use similar language to unsafe prompts or\nmention sensitive topics. In this paper, we introduce a new test suite called\nXSTest to identify such eXaggerated Safety behaviours in a systematic way.\nXSTest comprises 250 safe prompts across ten prompt types that well-calibrated\nmodels should not refuse to comply with, and 200 unsafe prompts as contrasts\nthat models, for most applications, should refuse. We describe XSTest's\ncreation and composition, and then use the test suite to highlight systematic\nfailure modes in state-of-the-art language models as well as more general\nchallenges in building safer language models.\n","authors":["Paul Röttger","Hannah Rose Kirk","Bertie Vidgen","Giuseppe Attanasio","Federico Bianchi","Dirk Hovy"],"pdf_url":"https://arxiv.org/pdf/2308.01263v2.pdf","comment":"v2 prepared for conference submission"},{"id":"http://arxiv.org/abs/2310.11374v1","updated":"2023-10-17T16:15:34Z","published":"2023-10-17T16:15:34Z","title":"DialogueLLM: Context and Emotion Knowledge-Tuned LLaMA Models for\n Emotion Recognition in Conversations","summary":" Large language models (LLMs) and their variants have shown extraordinary\nefficacy across numerous downstream natural language processing (NLP) tasks,\nwhich has presented a new vision for the development of NLP. Despite their\nremarkable performance in natural language generating (NLG), LLMs lack a\ndistinct focus on the emotion understanding domain. As a result, using LLMs for\nemotion recognition may lead to suboptimal and inadequate precision. Another\nlimitation of LLMs is that they are typical trained without leveraging\nmulti-modal information. To overcome these limitations, we propose DialogueLLM,\na context and emotion knowledge tuned LLM that is obtained by fine-tuning LLaMA\nmodels with 13,638 multi-modal (i.e., texts and videos) emotional dialogues.\nThe visual information is considered as the supplementary knowledge to\nconstruct high-quality instructions. We offer a comprehensive evaluation of our\nproposed model on three benchmarking emotion recognition in conversations (ERC)\ndatasets and compare the results against the SOTA baselines and other SOTA\nLLMs. Additionally, DialogueLLM-7B can be easily trained using LoRA on a 40GB\nA100 GPU in 5 hours, facilitating reproducibility for other researchers.\n","authors":["Yazhou Zhang","Mengyao Wang","Prayag Tiwari","Qiuchi Li","Benyou Wang","Jing Qin"],"pdf_url":"https://arxiv.org/pdf/2310.11374v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11368v1","updated":"2023-10-17T16:05:52Z","published":"2023-10-17T16:05:52Z","title":"VECHR: A Dataset for Explainable and Robust Classification of\n Vulnerability Type in the European Court of Human Rights","summary":" Recognizing vulnerability is crucial for understanding and implementing\ntargeted support to empower individuals in need. This is especially important\nat the European Court of Human Rights (ECtHR), where the court adapts\nConvention standards to meet actual individual needs and thus ensures effective\nhuman rights protection. However, the concept of vulnerability remains elusive\nat the ECtHR and no prior NLP research has dealt with it. To enable future\nresearch in this area, we present VECHR, a novel expert-annotated multi-label\ndataset comprising of vulnerability type classification and explanation\nrationale. We benchmark the performance of state-of-the-art models on VECHR\nfrom both prediction and explainability perspectives. Our results demonstrate\nthe challenging nature of the task with lower prediction performance and\nlimited agreement between models and experts. Further, we analyze the\nrobustness of these models in dealing with out-of-domain (OOD) data and observe\noverall limited performance. Our dataset poses unique challenges offering\nsignificant room for improvement regarding performance, explainability, and\nrobustness.\n","authors":["Shanshan Xu","Leon Staufer","Santosh T. Y. S. S","Oana Ichim","Corina Heri","Matthias Grabmair"],"pdf_url":"https://arxiv.org/pdf/2310.11368v1.pdf","comment":"Accepted to EMNLP 2022"},{"id":"http://arxiv.org/abs/2302.13959v2","updated":"2023-10-17T16:03:05Z","published":"2023-02-27T17:00:06Z","title":"Make Every Example Count: On the Stability and Utility of Self-Influence\n for Learning from Noisy NLP Datasets","summary":" Increasingly larger datasets have become a standard ingredient to advancing\nthe state-of-the-art in NLP. However, data quality might have already become\nthe bottleneck to unlock further gains. Given the diversity and the sizes of\nmodern datasets, standard data filtering is not straight-forward to apply,\nbecause of the multifacetedness of the harmful data and elusiveness of\nfiltering rules that would generalize across multiple tasks. We study the\nfitness of task-agnostic self-influence scores of training examples for data\ncleaning, analyze their efficacy in capturing naturally occurring outliers, and\ninvestigate to what extent self-influence based data cleaning can improve\ndownstream performance in machine translation, question answering and text\nclassification, building up on recent approaches to self-influence calculation\nand automated curriculum learning.\n","authors":["Irina Bejan","Artem Sokolov","Katja Filippova"],"pdf_url":"https://arxiv.org/pdf/2302.13959v2.pdf","comment":"Published at EMNLP 2023"},{"id":"http://arxiv.org/abs/2204.04793v2","updated":"2023-10-17T16:02:42Z","published":"2022-04-10T23:16:00Z","title":"Fake news detection using parallel BERT deep neural networks","summary":" Fake news is a growing challenge for social networks and media. Detection of\nfake news always has been a problem for many years, but after the evolution of\nsocial networks and increasing speed of news dissemination in recent years has\nbeen considered again. There are several approaches to solving this problem,\none of which is to detect fake news based on its text style using deep neural\nnetworks. In recent years, one of the most used forms of deep neural networks\nfor natural language processing is transfer learning with transformers. BERT is\none of the most promising transformers who outperforms other models in many NLP\nbenchmarks. This article, we introduce MWPBert, which uses two parallel BERT\nnetworks to perform veracity detection on full-text news articles. One of the\nBERT networks encodes news headline, and another encodes news body. Since the\ninput length of the BERT network is limited and constant and the news body is\nusually a long text, we cannot fed the whole news text into the BERT.\nTherefore, using the MaxWorth algorithm, we selected the part of the news text\nthat is more valuable for fact-checking, and fed it into the BERT network.\nFinally, we encode the output of the two BERT networks to an output network to\nclassify the news. The experiment results showed that the proposed model\noutperformed previous models in terms of accuracy and other performance\nmeasures.\n","authors":["Mahmood Farokhian","Vahid Rafe","Hadi Veisi"],"pdf_url":"https://arxiv.org/pdf/2204.04793v2.pdf","comment":"Multimed Tools Appl (2023)"},{"id":"http://arxiv.org/abs/2310.11363v1","updated":"2023-10-17T16:00:26Z","published":"2023-10-17T16:00:26Z","title":"Disentangling the Linguistic Competence of Privacy-Preserving BERT","summary":" Differential Privacy (DP) has been tailored to address the unique challenges\nof text-to-text privatization. However, text-to-text privatization is known for\ndegrading the performance of language models when trained on perturbed text.\nEmploying a series of interpretation techniques on the internal representations\nextracted from BERT trained on perturbed pre-text, we intend to disentangle at\nthe linguistic level the distortion induced by differential privacy.\nExperimental results from a representational similarity analysis indicate that\nthe overall similarity of internal representations is substantially reduced.\nUsing probing tasks to unpack this dissimilarity, we find evidence that\ntext-to-text privatization affects the linguistic competence across several\nformalisms, encoding localized properties of words while falling short at\nencoding the contextual relationships between spans of words.\n","authors":["Stefan Arnold","Nils Kemmerzell","Annika Schreiner"],"pdf_url":"https://arxiv.org/pdf/2310.11363v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11360v1","updated":"2023-10-17T15:55:31Z","published":"2023-10-17T15:55:31Z","title":"Enhancing Neural Machine Translation with Semantic Units","summary":" Conventional neural machine translation (NMT) models typically use subwords\nand words as the basic units for model input and comprehension. However,\ncomplete words and phrases composed of several tokens are often the fundamental\nunits for expressing semantics, referred to as semantic units. To address this\nissue, we propose a method Semantic Units for Machine Translation (SU4MT) which\nmodels the integral meanings of semantic units within a sentence, and then\nleverages them to provide a new perspective for understanding the sentence.\nSpecifically, we first propose Word Pair Encoding (WPE), a phrase extraction\nmethod to help identify the boundaries of semantic units. Next, we design an\nAttentive Semantic Fusion (ASF) layer to integrate the semantics of multiple\nsubwords into a single vector: the semantic unit representation. Lastly, the\nsemantic-unit-level sentence representation is concatenated to the token-level\none, and they are combined as the input of encoder. Experimental results\ndemonstrate that our method effectively models and leverages\nsemantic-unit-level information and outperforms the strong baselines. The code\nis available at https://github.com/ictnlp/SU4MT.\n","authors":["Langlin Huang","Shuhao Gu","Zhuocheng Zhang","Yang Feng"],"pdf_url":"https://arxiv.org/pdf/2310.11360v1.pdf","comment":"Accepted to EMNLP findings 2023"},{"id":"http://arxiv.org/abs/2305.12785v2","updated":"2023-10-17T15:48:15Z","published":"2023-05-22T07:30:35Z","title":"MacLaSa: Multi-Aspect Controllable Text Generation via Efficient\n Sampling from Compact Latent Space","summary":" Multi-aspect controllable text generation aims to generate fluent sentences\nthat possess multiple desired attributes simultaneously. Traditional methods\neither combine many operators in the decoding stage, often with costly\niteration or search in the discrete text space, or train separate controllers\nfor each aspect, resulting in a degeneration of text quality due to the\ndiscrepancy between different aspects. To address these limitations, we\nintroduce a novel approach for multi-aspect control, namely MacLaSa, that\nestimates compact latent space for multiple aspects and performs efficient\nsampling with a robust sampler based on ordinary differential equations (ODEs).\nTo eliminate the domain gaps between different aspects, we utilize a\nVariational Autoencoder (VAE) network to map text sequences from varying data\nsources into close latent representations. The estimated latent space enables\nthe formulation of joint energy-based models (EBMs) and the plugging in of\narbitrary attribute discriminators to achieve multi-aspect control. Afterwards,\nwe draw latent vector samples with an ODE-based sampler and feed sampled\nexamples to the VAE decoder to produce target text sequences. Experimental\nresults demonstrate that MacLaSa outperforms several strong baselines on\nattribute relevance and textual quality while maintaining a high inference\nspeed.\n","authors":["Hanxing Ding","Liang Pang","Zihao Wei","Huawei Shen","Xueqi Cheng","Tat-Seng Chua"],"pdf_url":"https://arxiv.org/pdf/2305.12785v2.pdf","comment":"Accepted to the Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2304.12206v2","updated":"2023-10-17T15:46:54Z","published":"2023-04-24T15:46:26Z","title":"PAXQA: Generating Cross-lingual Question Answering Examples at Training\n Scale","summary":" Existing question answering (QA) systems owe much of their success to large,\nhigh-quality training data. Such annotation efforts are costly, and the\ndifficulty compounds in the cross-lingual setting. Therefore, prior\ncross-lingual QA work has focused on releasing evaluation datasets, and then\napplying zero-shot methods as baselines. This work proposes a synthetic data\ngeneration method for cross-lingual QA which leverages indirect supervision\nfrom existing parallel corpora. Our method termed PAXQA (Projecting annotations\nfor cross-lingual (x) QA) decomposes cross-lingual QA into two stages. First,\nwe apply a question generation (QG) model to the English side. Second, we apply\nannotation projection to translate both the questions and answers. To better\ntranslate questions, we propose a novel use of lexically-constrained machine\ntranslation, in which constrained entities are extracted from the parallel\nbitexts.\n We apply PAXQA to generate cross-lingual QA examples in 4 languages (662K\nexamples total), and perform human evaluation on a subset to create validation\nand test splits. We then show that models fine-tuned on these datasets\noutperform prior synthetic data generation models over several extractive QA\ndatasets. The largest performance gains are for directions with non-English\nquestions and English contexts. Ablation studies show that our dataset\ngeneration method is relatively robust to noise from automatic word alignments,\nshowing the sufficient quality of our generations. To facilitate follow-up\nwork, we release our code and datasets at https://github.com/manestay/paxqa .\n","authors":["Bryan Li","Chris Callison-Burch"],"pdf_url":"https://arxiv.org/pdf/2304.12206v2.pdf","comment":"EMNLP 2023 (Findings)"},{"id":"http://arxiv.org/abs/2310.11344v1","updated":"2023-10-17T15:26:40Z","published":"2023-10-17T15:26:40Z","title":"The effect of stemming and lemmatization on Portuguese fake news text\n classification","summary":" With the popularization of the internet, smartphones and social media,\ninformation is being spread quickly and easily way, which implies bigger\ntraffic of information in the world, but there is a problem that is harming\nsociety with the dissemination of fake news. With a bigger flow of information,\nsome people are trying to disseminate deceptive information and fake news. The\nautomatic detection of fake news is a challenging task because to obtain a good\nresult is necessary to deal with linguistics problems, especially when we are\ndealing with languages that not have been comprehensively studied yet, besides\nthat, some techniques can help to reach a good result when we are dealing with\ntext data, although, the motivation of detecting this deceptive information it\nis in the fact that the people need to know which information is true and\ntrustful and which one is not. In this work, we present the effect the\npre-processing methods such as lemmatization and stemming have on fake news\nclassification, for that we designed some classifier models applying different\npre-processing techniques. The results show that the pre-processing step is\nimportant to obtain betters results, the stemming and lemmatization techniques\nare interesting methods and need to be more studied to develop techniques\nfocused on the Portuguese language so we can reach better results.\n","authors":["Lucca de Freitas Santos","Murilo Varges da Silva"],"pdf_url":"https://arxiv.org/pdf/2310.11344v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10236v3","updated":"2023-10-17T15:20:19Z","published":"2023-07-16T08:28:04Z","title":"Look Before You Leap: An Exploratory Study of Uncertainty Measurement\n for Large Language Models","summary":" The recent performance leap of Large Language Models (LLMs) opens up new\nopportunities across numerous industrial applications and domains. However,\nerroneous generations, such as false predictions, misinformation, and\nhallucination made by LLMs, have also raised severe concerns for the\ntrustworthiness of LLMs', especially in safety-, security- and\nreliability-sensitive scenarios, potentially hindering real-world adoptions.\nWhile uncertainty estimation has shown its potential for interpreting the\nprediction risks made by general machine learning (ML) models, little is known\nabout whether and to what extent it can help explore an LLM's capabilities and\ncounteract its undesired behavior. To bridge the gap, in this paper, we\ninitiate an exploratory study on the risk assessment of LLMs from the lens of\nuncertainty. In particular, we experiment with twelve uncertainty estimation\nmethods and four LLMs on four prominent natural language processing (NLP) tasks\nto investigate to what extent uncertainty estimation techniques could help\ncharacterize the prediction risks of LLMs. Our findings validate the\neffectiveness of uncertainty estimation for revealing LLMs'\nuncertain/non-factual predictions. In addition to general NLP tasks, we\nextensively conduct experiments with four LLMs for code generation on two\ndatasets. We find that uncertainty estimation can potentially uncover buggy\nprograms generated by LLMs. Insights from our study shed light on future design\nand development for reliable LLMs, facilitating further research toward\nenhancing the trustworthiness of LLMs.\n","authors":["Yuheng Huang","Jiayang Song","Zhijie Wang","Shengming Zhao","Huaming Chen","Felix Juefei-Xu","Lei Ma"],"pdf_url":"https://arxiv.org/pdf/2307.10236v3.pdf","comment":"20 pages, 4 figures"},{"id":"http://arxiv.org/abs/2310.11324v1","updated":"2023-10-17T15:03:30Z","published":"2023-10-17T15:03:30Z","title":"Quantifying Language Models' Sensitivity to Spurious Features in Prompt\n Design or: How I learned to start worrying about prompt formatting","summary":" As large language models (LLMs) are adopted as a fundamental component of\nlanguage technologies, it is crucial to accurately characterize their\nperformance. Because choices in prompt design can strongly influence model\nbehavior, this design process is critical in effectively using any modern\npre-trained generative language model. In this work, we focus on LLM\nsensitivity to a quintessential class of meaning-preserving design choices:\nprompt formatting. We find that several widely used open-source LLMs are\nextremely sensitive to subtle changes in prompt formatting in few-shot\nsettings, with performance differences of up to 76 accuracy points when\nevaluated using LLaMA-2-13B. Sensitivity remains even when increasing model\nsize, the number of few-shot examples, or performing instruction tuning. Our\nanalysis suggests that work evaluating LLMs with prompting-based methods would\nbenefit from reporting a range of performance across plausible prompt formats,\ninstead of the currently-standard practice of reporting performance on a single\nformat. We also show that format performance only weakly correlates between\nmodels, which puts into question the methodological validity of comparing\nmodels with an arbitrarily chosen, fixed prompt format. To facilitate\nsystematic analysis we propose FormatSpread, an algorithm that rapidly\nevaluates a sampled set of plausible prompt formats for a given task, and\nreports the interval of expected performance without accessing model weights.\nFurthermore, we present a suite of analyses that characterize the nature of\nthis sensitivity, including exploring the influence of particular atomic\nperturbations and the internal representation of particular formats.\n","authors":["Melanie Sclar","Yejin Choi","Yulia Tsvetkov","Alane Suhr"],"pdf_url":"https://arxiv.org/pdf/2310.11324v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.09652v2","updated":"2023-10-17T14:59:28Z","published":"2023-05-16T17:53:03Z","title":"The Interpreter Understands Your Meaning: End-to-end Spoken Language\n Understanding Aided by Speech Translation","summary":" End-to-end spoken language understanding (SLU) remains elusive even with\ncurrent large pretrained language models on text and speech, especially in\nmultilingual cases. Machine translation has been established as a powerful\npretraining objective on text as it enables the model to capture high-level\nsemantics of the input utterance and associations between different languages,\nwhich is desired for speech models that work on lower-level acoustic frames.\nMotivated particularly by the task of cross-lingual SLU, we demonstrate that\nthe task of speech translation (ST) is a good means of pretraining speech\nmodels for end-to-end SLU on both intra- and cross-lingual scenarios.\n By introducing ST, our models reach higher performance over baselines on\nmonolingual and multilingual intent classification as well as spoken question\nanswering using SLURP, MINDS-14, and NMSQA benchmarks. To verify the\neffectiveness of our methods, we also create new benchmark datasets from both\nsynthetic and real sources, for speech summarization and low-resource/zero-shot\ntransfer from English to French or Spanish. We further show the value of\npreserving knowledge for the ST pretraining task for better downstream\nperformance, possibly using Bayesian transfer regularizers.\n","authors":["Mutian He","Philip N. Garner"],"pdf_url":"https://arxiv.org/pdf/2305.09652v2.pdf","comment":"16 pages, 3 figures; accepted by Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.07276v2","updated":"2023-10-17T14:55:58Z","published":"2023-10-11T07:57:08Z","title":"BioT5: Enriching Cross-modal Integration in Biology with Chemical\n Knowledge and Natural Language Associations","summary":" Recent advancements in biological research leverage the integration of\nmolecules, proteins, and natural language to enhance drug discovery. However,\ncurrent models exhibit several limitations, such as the generation of invalid\nmolecular SMILES, underutilization of contextual information, and equal\ntreatment of structured and unstructured knowledge. To address these issues, we\npropose $\\mathbf{BioT5}$, a comprehensive pre-training framework that enriches\ncross-modal integration in biology with chemical knowledge and natural language\nassociations. $\\mathbf{BioT5}$ utilizes SELFIES for $100%$ robust molecular\nrepresentations and extracts knowledge from the surrounding context of\nbio-entities in unstructured biological literature. Furthermore,\n$\\mathbf{BioT5}$ distinguishes between structured and unstructured knowledge,\nleading to more effective utilization of information. After fine-tuning, BioT5\nshows superior performance across a wide range of tasks, demonstrating its\nstrong capability of capturing underlying relations and properties of\nbio-entities. Our code is available at\n$\\href{https://github.com/QizhiPei/BioT5}{Github}$.\n","authors":["Qizhi Pei","Wei Zhang","Jinhua Zhu","Kehan Wu","Kaiyuan Gao","Lijun Wu","Yingce Xia","Rui Yan"],"pdf_url":"https://arxiv.org/pdf/2310.07276v2.pdf","comment":"Accepted by Empirical Methods in Natural Language Processing 2023\n (EMNLP 2023)"},{"id":"http://arxiv.org/abs/2310.11318v1","updated":"2023-10-17T14:52:33Z","published":"2023-10-17T14:52:33Z","title":"Utilising a Large Language Model to Annotate Subject Metadata: A Case\n Study in an Australian National Research Data Catalogue","summary":" In support of open and reproducible research, there has been a rapidly\nincreasing number of datasets made available for research. As the availability\nof datasets increases, it becomes more important to have quality metadata for\ndiscovering and reusing them. Yet, it is a common issue that datasets often\nlack quality metadata due to limited resources for data curation. Meanwhile,\ntechnologies such as artificial intelligence and large language models (LLMs)\nare progressing rapidly. Recently, systems based on these technologies, such as\nChatGPT, have demonstrated promising capabilities for certain data curation\ntasks. This paper proposes to leverage LLMs for cost-effective annotation of\nsubject metadata through the LLM-based in-context learning. Our method employs\nGPT-3.5 with prompts designed for annotating subject metadata, demonstrating\npromising performance in automatic metadata annotation. However, models based\non in-context learning cannot acquire discipline-specific rules, resulting in\nlower performance in several categories. This limitation arises from the\nlimited contextual information available for subject inference. To the best of\nour knowledge, we are introducing, for the first time, an in-context learning\nmethod that harnesses large language models for automated subject metadata\nannotation.\n","authors":["Shiwei Zhang","Mingfang Wu","Xiuzhen Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.11318v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09520v2","updated":"2023-10-17T14:48:25Z","published":"2023-10-14T07:19:47Z","title":"Reward-Augmented Decoding: Efficient Controlled Text Generation With a\n Unidirectional Reward Model","summary":" While large language models have proven effective in a huge range of\ndownstream applications, they often generate text that is problematic or lacks\na desired attribute. In this paper, we introduce Reward-Augmented Decoding\n(RAD), a text generation procedure that uses a small unidirectional reward\nmodel to encourage a language model to generate text that has certain\nproperties. Specifically, RAD uses the reward model to score generations as\nthey are produced and rescales sampling probabilities to favor high-reward\ntokens. By using a unidirectional reward model, RAD can cache activations from\nprior generation steps to decrease computational overhead. Through experiments\non generating non-toxic and sentiment-controlled text, we demonstrate that RAD\nperforms best among methods that change only the generation procedure and\nmatches the performance of state-of-the-art methods that involve re-training\nthe language model. We further validate that RAD is effective on very large\nlanguage models while incurring a minimal computational overhead.\n","authors":["Haikang Deng","Colin Raffel"],"pdf_url":"https://arxiv.org/pdf/2310.09520v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.11171v2","updated":"2023-10-17T14:45:39Z","published":"2023-05-18T17:58:35Z","title":"TrueTeacher: Learning Factual Consistency Evaluation with Large Language\n Models","summary":" Factual consistency evaluation is often conducted using Natural Language\nInference (NLI) models, yet these models exhibit limited success in evaluating\nsummaries. Previous work improved such models with synthetic training data.\nHowever, the data is typically based on perturbed human-written summaries,\nwhich often differ in their characteristics from real model-generated summaries\nand have limited coverage of possible factual errors. Alternatively, large\nlanguage models (LLMs) have recently shown promising results in directly\nevaluating generative tasks, but are too computationally expensive for\npractical use. Motivated by these limitations, we introduce TrueTeacher, a\nmethod for generating synthetic data by annotating diverse model-generated\nsummaries using a LLM. Unlike prior work, TrueTeacher does not rely on\nhuman-written summaries, and is multilingual by nature. Experiments on the TRUE\nbenchmark show that a student model trained using our data, substantially\noutperforms both the state-of-the-art model with similar capacity, and the LLM\nteacher. In a systematic study, we compare TrueTeacher to existing synthetic\ndata generation methods and demonstrate its superiority and robustness to\ndomain-shift. We also show that our method generalizes to multilingual\nscenarios. Lastly, we release our large scale synthetic dataset (1.4M\nexamples), generated using TrueTeacher, and a checkpoint trained on this data.\n","authors":["Zorik Gekhman","Jonathan Herzig","Roee Aharoni","Chen Elkind","Idan Szpektor"],"pdf_url":"https://arxiv.org/pdf/2305.11171v2.pdf","comment":"Accepted as a long paper in EMNLP 2023"},{"id":"http://arxiv.org/abs/2305.15040v2","updated":"2023-10-17T14:37:10Z","published":"2023-05-24T11:27:53Z","title":"Active Learning for Natural Language Generation","summary":" The field of Natural Language Generation (NLG) suffers from a severe shortage\nof labeled data due to the extremely expensive and time-consuming process\ninvolved in manual annotation. A natural approach for coping with this problem\nis active learning (AL), a well-known machine learning technique for improving\nannotation efficiency by selectively choosing the most informative examples to\nlabel. However, while AL has been well-researched in the context of text\nclassification, its application to NLG remains largely unexplored. In this\npaper, we present a first systematic study of active learning for NLG,\nconsidering a diverse set of tasks and multiple leading selection strategies,\nand harnessing a strong instruction-tuned model. Our results indicate that the\nperformance of existing AL strategies is inconsistent, surpassing the baseline\nof random example selection in some cases but not in others. We highlight some\nnotable differences between the classification and generation scenarios, and\nanalyze the selection behaviors of existing AL strategies. Our findings\nmotivate exploring novel approaches for applying AL to generation tasks.\n","authors":["Yotam Perlitz","Ariel Gera","Michal Shmueli-Scheuer","Dafna Sheinwald","Noam Slonim","Liat Ein-Dor"],"pdf_url":"https://arxiv.org/pdf/2305.15040v2.pdf","comment":"Accepted to EMNLP2023 as a long paper"},{"id":"http://arxiv.org/abs/2310.11303v1","updated":"2023-10-17T14:27:34Z","published":"2023-10-17T14:27:34Z","title":"QADYNAMICS: Training Dynamics-Driven Synthetic QA Diagnostic for\n Zero-Shot Commonsense Question Answering","summary":" Zero-shot commonsense Question-Answering (QA) requires models to reason about\ngeneral situations beyond specific benchmarks. State-of-the-art approaches\nfine-tune language models on QA pairs constructed from CommonSense Knowledge\nBases (CSKBs) to equip the models with more commonsense knowledge in a QA\ncontext. However, current QA synthesis protocols may introduce noise from the\nCSKBs and generate ungrammatical questions and false negative options, which\nimpede the model's ability to generalize. To address these issues, we propose\nQADYNAMICS, a training dynamics-driven framework for QA diagnostics and\nrefinement. Our approach analyzes the training dynamics of each QA pair at both\nthe question level and option level, discarding machine-detectable artifacts by\nremoving uninformative QA pairs and mislabeled or false-negative options.\nExtensive experiments demonstrate the effectiveness of our approach, which\noutperforms all baselines while using only 33% of the synthetic data, even\nincluding LLMs such as ChatGPT. Moreover, expert evaluations confirm that our\nframework significantly improves the quality of QA synthesis. Our codes and\nmodel checkpoints are available at\nhttps://github.com/HKUST-KnowComp/QaDynamics.\n","authors":["Haochen Shi","Weiqi Wang","Tianqing Fang","Baixuan Xu","Wenxuan Ding","Xin Liu","Yangqiu Song"],"pdf_url":"https://arxiv.org/pdf/2310.11303v1.pdf","comment":"Findings of EMNLP2023"},{"id":"http://arxiv.org/abs/2302.11713v5","updated":"2023-10-17T14:19:13Z","published":"2023-02-23T00:33:54Z","title":"Can Pre-trained Vision and Language Models Answer Visual\n Information-Seeking Questions?","summary":" Pre-trained vision and language models have demonstrated state-of-the-art\ncapabilities over existing tasks involving images and texts, including visual\nquestion answering. However, it remains unclear whether these models possess\nthe capability to answer questions that are not only querying visual content\nbut knowledge-intensive and information-seeking. In this study, we introduce\nInfoSeek, a visual question answering dataset tailored for information-seeking\nquestions that cannot be answered with only common sense knowledge. Using\nInfoSeek, we analyze various pre-trained visual question answering models and\ngain insights into their characteristics. Our findings reveal that\nstate-of-the-art pre-trained multi-modal models (e.g., PaLI-X, BLIP2, etc.)\nface challenges in answering visual information-seeking questions, but\nfine-tuning on the InfoSeek dataset elicits models to use fine-grained\nknowledge that was learned during their pre-training. Furthermore, we show that\naccurate visual entity recognition can be used to improve performance on\nInfoSeek by retrieving relevant documents, showing a significant space for\nimprovement.\n","authors":["Yang Chen","Hexiang Hu","Yi Luan","Haitian Sun","Soravit Changpinyo","Alan Ritter","Ming-Wei Chang"],"pdf_url":"https://arxiv.org/pdf/2302.11713v5.pdf","comment":"EMNLP 2023 (main conference); Our dataset and evaluation is available\n at https://open-vision-language.github.io/infoseek/"},{"id":"http://arxiv.org/abs/2309.12871v3","updated":"2023-10-17T14:08:53Z","published":"2023-09-22T13:52:42Z","title":"AnglE-optimized Text Embeddings","summary":" High-quality text embedding is pivotal in improving semantic textual\nsimilarity (STS) tasks, which are crucial components in Large Language Model\n(LLM) applications. However, a common challenge existing text embedding models\nface is the problem of vanishing gradients, primarily due to their reliance on\nthe cosine function in the optimization objective, which has saturation zones.\nTo address this issue, this paper proposes a novel angle-optimized text\nembedding model called AnglE. The core idea of AnglE is to introduce angle\noptimization in a complex space. This novel approach effectively mitigates the\nadverse effects of the saturation zone in the cosine function, which can impede\ngradient and hinder optimization processes. To set up a comprehensive STS\nevaluation, we experimented on existing short-text STS datasets and a newly\ncollected long-text STS dataset from GitHub Issues. Furthermore, we examine\ndomain-specific STS scenarios with limited labeled data and explore how AnglE\nworks with LLM-annotated data. Extensive experiments were conducted on various\ntasks including short-text STS, long-text STS, and domain-specific STS tasks.\nThe results show that AnglE outperforms the state-of-the-art (SOTA) STS models\nthat ignore the cosine saturation zone. These findings demonstrate the ability\nof AnglE to generate high-quality text embeddings and the usefulness of angle\noptimization in STS.\n","authors":["Xianming Li","Jing Li"],"pdf_url":"https://arxiv.org/pdf/2309.12871v3.pdf","comment":"update llama results"},{"id":"http://arxiv.org/abs/2310.11282v1","updated":"2023-10-17T14:06:06Z","published":"2023-10-17T14:06:06Z","title":"ChapGTP, ILLC's Attempt at Raising a BabyLM: Improving Data Efficiency\n by Automatic Task Formation","summary":" We present the submission of the ILLC at the University of Amsterdam to the\nBabyLM challenge (Warstadt et al., 2023), in the strict-small track. Our final\nmodel, ChapGTP, is a masked language model that was trained for 200 epochs,\naided by a novel data augmentation technique called Automatic Task Formation.\nWe discuss in detail the performance of this model on the three evaluation\nsuites: BLiMP, (Super)GLUE, and MSGS. Furthermore, we present a wide range of\nmethods that were ultimately not included in the model, but may serve as\ninspiration for training LMs in low-resource settings.\n","authors":["Jaap Jumelet","Michael Hanna","Marianne de Heer Kloots","Anna Langedijk","Charlotte Pouw","Oskar van der Wal"],"pdf_url":"https://arxiv.org/pdf/2310.11282v1.pdf","comment":"Part of the BabyLM challenge at CoNLL"},{"id":"http://arxiv.org/abs/2310.11275v1","updated":"2023-10-17T13:53:57Z","published":"2023-10-17T13:53:57Z","title":"xMEN: A Modular Toolkit for Cross-Lingual Medical Entity Normalization","summary":" Objective: To improve performance of medical entity normalization across many\nlanguages, especially when fewer language resources are available compared to\nEnglish.\n Materials and Methods: We introduce xMEN, a modular system for cross-lingual\nmedical entity normalization, which performs well in both low- and\nhigh-resource scenarios. When synonyms in the target language are scarce for a\ngiven terminology, we leverage English aliases via cross-lingual candidate\ngeneration. For candidate ranking, we incorporate a trainable cross-encoder\nmodel if annotations for the target task are available. We also evaluate\ncross-encoders trained in a weakly supervised manner based on\nmachine-translated datasets from a high resource domain. Our system is publicly\navailable as an extensible Python toolkit.\n Results: xMEN improves the state-of-the-art performance across a wide range\nof multilingual benchmark datasets. Weakly supervised cross-encoders are\neffective when no training data is available for the target task. Through the\ncompatibility of xMEN with the BigBIO framework, it can be easily used with\nexisting and prospective datasets.\n Discussion: Our experiments show the importance of balancing the output of\ngeneral-purpose candidate generators with subsequent trainable re-rankers,\nwhich we achieve through a rank regularization term in the loss function of the\ncross-encoder. However, error analysis reveals that multi-word expressions and\nother complex entities are still challenging.\n Conclusion: xMEN exhibits strong performance for medical entity normalization\nin multiple languages, even when no labeled data and few terminology aliases\nfor the target language are available. Its configuration system and evaluation\nmodules enable reproducible benchmarks. Models and code are available online at\nthe following URL: https://github.com/hpi-dhc/xmen\n","authors":["Florian Borchert","Ignacio Llorca","Roland Roller","Bert Arnrich","Matthieu-P. Schapranow"],"pdf_url":"https://arxiv.org/pdf/2310.11275v1.pdf","comment":"16 pages, 3 figures"},{"id":"http://arxiv.org/abs/2310.11266v1","updated":"2023-10-17T13:39:26Z","published":"2023-10-17T13:39:26Z","title":"Emulating Human Cognitive Processes for Expert-Level Medical\n Question-Answering with Large Language Models","summary":" In response to the pressing need for advanced clinical problem-solving tools\nin healthcare, we introduce BooksMed, a novel framework based on a Large\nLanguage Model (LLM). BooksMed uniquely emulates human cognitive processes to\ndeliver evidence-based and reliable responses, utilizing the GRADE (Grading of\nRecommendations, Assessment, Development, and Evaluations) framework to\neffectively quantify evidence strength. For clinical decision-making to be\nappropriately assessed, an evaluation metric that is clinically aligned and\nvalidated is required. As a solution, we present ExpertMedQA, a multispecialty\nclinical benchmark comprised of open-ended, expert-level clinical questions,\nand validated by a diverse group of medical professionals. By demanding an\nin-depth understanding and critical appraisal of up-to-date clinical\nliterature, ExpertMedQA rigorously evaluates LLM performance. BooksMed\noutperforms existing state-of-the-art models Med-PaLM 2, Almanac, and ChatGPT\nin a variety of medical scenarios. Therefore, a framework that mimics human\ncognitive stages could be a useful tool for providing reliable and\nevidence-based responses to clinical inquiries.\n","authors":["Khushboo Verma","Marina Moore","Stephanie Wottrich","Karla Robles López","Nishant Aggarwal","Zeel Bhatt","Aagamjit Singh","Bradford Unroe","Salah Basheer","Nitish Sachdeva","Prinka Arora","Harmanjeet Kaur","Tanupreet Kaur","Tevon Hood","Anahi Marquez","Tushar Varshney","Nanfu Deng","Azaan Ramani","Pawanraj Ishwara","Maimoona Saeed","Tatiana López Velarde Peña","Bryan Barksdale","Sushovan Guha","Satwant Kumar"],"pdf_url":"https://arxiv.org/pdf/2310.11266v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.14107v2","updated":"2023-10-17T13:38:27Z","published":"2023-09-25T13:00:33Z","title":"Wav2vec-based Detection and Severity Level Classification of Dysarthria\n from Speech","summary":" Automatic detection and severity level classification of dysarthria directly\nfrom acoustic speech signals can be used as a tool in medical diagnosis. In\nthis work, the pre-trained wav2vec 2.0 model is studied as a feature extractor\nto build detection and severity level classification systems for dysarthric\nspeech. The experiments were carried out with the popularly used UA-speech\ndatabase. In the detection experiments, the results revealed that the best\nperformance was obtained using the embeddings from the first layer of the\nwav2vec model that yielded an absolute improvement of 1.23% in accuracy\ncompared to the best performing baseline feature (spectrogram). In the studied\nseverity level classification task, the results revealed that the embeddings\nfrom the final layer gave an absolute improvement of 10.62% in accuracy\ncompared to the best baseline features (mel-frequency cepstral coefficients).\n","authors":["Farhad Javanmardi","Saska Tirronen","Manila Kodali","Sudarsana Reddy Kadiri","Paavo Alku"],"pdf_url":"https://arxiv.org/pdf/2309.14107v2.pdf","comment":"copyright 2023 IEEE. Personal use of this material is permitted.\n Permission from IEEE must be obtained for all other uses, in any current or\n future media, including reprinting/republishing this material for advertising\n or promotional purposes, creating new collective works, for resale or\n redistribution to servers or lists, or reuse of any copyrighted component of\n this work in other works"},{"id":"http://arxiv.org/abs/2309.14080v2","updated":"2023-10-17T13:36:36Z","published":"2023-09-25T12:14:25Z","title":"Analysis and Detection of Pathological Voice using Glottal Source\n Features","summary":" Automatic detection of voice pathology enables objective assessment and\nearlier intervention for the diagnosis. This study provides a systematic\nanalysis of glottal source features and investigates their effectiveness in\nvoice pathology detection. Glottal source features are extracted using glottal\nflows estimated with the quasi-closed phase (QCP) glottal inverse filtering\nmethod, using approximate glottal source signals computed with the zero\nfrequency filtering (ZFF) method, and using acoustic voice signals directly. In\naddition, we propose to derive mel-frequency cepstral coefficients (MFCCs) from\nthe glottal source waveforms computed by QCP and ZFF to effectively capture the\nvariations in glottal source spectra of pathological voice. Experiments were\ncarried out using two databases, the Hospital Universitario Principe de\nAsturias (HUPA) database and the Saarbrucken Voice Disorders (SVD) database.\nAnalysis of features revealed that the glottal source contains information that\ndiscriminates normal and pathological voice. Pathology detection experiments\nwere carried out using support vector machine (SVM). From the detection\nexperiments it was observed that the performance achieved with the studied\nglottal source features is comparable or better than that of conventional MFCCs\nand perceptual linear prediction (PLP) features. The best detection performance\nwas achieved when the glottal source features were combined with the\nconventional MFCCs and PLP features, which indicates the complementary nature\nof the features.\n","authors":["Sudarsana Reddy Kadiri","Paavo Alku"],"pdf_url":"https://arxiv.org/pdf/2309.14080v2.pdf","comment":"Copyright 2020 IEEE. Personal use of this material is permitted.\n Permission from IEEE must be obtained for all other uses, in any current or\n future media, including reprinting/republishing this material for advertising\n or promotional purposes, creating new collective works, for resale or\n redistribution to servers or lists, or reuse of any copyrighted component of\n this work in other works"},{"id":"http://arxiv.org/abs/2305.07609v3","updated":"2023-10-17T13:29:54Z","published":"2023-05-12T16:54:36Z","title":"Is ChatGPT Fair for Recommendation? Evaluating Fairness in Large\n Language Model Recommendation","summary":" The remarkable achievements of Large Language Models (LLMs) have led to the\nemergence of a novel recommendation paradigm -- Recommendation via LLM\n(RecLLM). Nevertheless, it is important to note that LLMs may contain social\nprejudices, and therefore, the fairness of recommendations made by RecLLM\nrequires further investigation. To avoid the potential risks of RecLLM, it is\nimperative to evaluate the fairness of RecLLM with respect to various sensitive\nattributes on the user side. Due to the differences between the RecLLM paradigm\nand the traditional recommendation paradigm, it is problematic to directly use\nthe fairness benchmark of traditional recommendation. To address the dilemma,\nwe propose a novel benchmark called Fairness of Recommendation via LLM\n(FaiRLLM). This benchmark comprises carefully crafted metrics and a dataset\nthat accounts for eight sensitive attributes1 in two recommendation scenarios:\nmusic and movies. By utilizing our FaiRLLM benchmark, we conducted an\nevaluation of ChatGPT and discovered that it still exhibits unfairness to some\nsensitive attributes when generating recommendations. Our code and dataset can\nbe found at https://github.com/jizhi-zhang/FaiRLLM.\n","authors":["Jizhi Zhang","Keqin Bao","Yang Zhang","Wenjie Wang","Fuli Feng","Xiangnan He"],"pdf_url":"https://arxiv.org/pdf/2305.07609v3.pdf","comment":"Accepted by Recsys 2023 (Short)"},{"id":"http://arxiv.org/abs/2310.11258v1","updated":"2023-10-17T13:23:18Z","published":"2023-10-17T13:23:18Z","title":"Utilizing Weak Supervision To Generate Indonesian Conservation Dataset","summary":" Weak supervision has emerged as a promising approach for rapid and\nlarge-scale dataset creation in response to the increasing demand for\naccelerated NLP development. By leveraging labeling functions, weak supervision\nallows practitioners to generate datasets quickly by creating learned label\nmodels that produce soft-labeled datasets. This paper aims to show how such an\napproach can be utilized to build an Indonesian NLP dataset from conservation\nnews text. We construct two types of datasets: multi-class classification and\nsentiment classification. We then provide baseline experiments using various\npretrained language models. These baseline results demonstrate test\nperformances of 59.79% accuracy and 55.72% F1-score for sentiment\nclassification, 66.87% F1-score-macro, 71.5% F1-score-micro, and 83.67% ROC-AUC\nfor multi-class classification. Additionally, we release the datasets and\nlabeling functions used in this work for further research and exploration.\n","authors":["Mega Fransiska","Diah Pitaloka"," Saripudin","Satrio Putra","Lintang Sutawika"],"pdf_url":"https://arxiv.org/pdf/2310.11258v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11252v1","updated":"2023-10-17T13:20:16Z","published":"2023-10-17T13:20:16Z","title":"Revealing the Unwritten: Visual Investigation of Beam Search Trees to\n Address Language Model Prompting Challenges","summary":" The growing popularity of generative language models has amplified interest\nin interactive methods to guide model outputs. Prompt refinement is considered\none of the most effective means to influence output among these methods. We\nidentify several challenges associated with prompting large language models,\ncategorized into data- and model-specific, linguistic, and socio-linguistic\nchallenges. A comprehensive examination of model outputs, including runner-up\ncandidates and their corresponding probabilities, is needed to address these\nissues. The beam search tree, the prevalent algorithm to sample model outputs,\ncan inherently supply this information. Consequently, we introduce an\ninteractive visual method for investigating the beam search tree, facilitating\nanalysis of the decisions made by the model during generation. We\nquantitatively show the value of exposing the beam search tree and present five\ndetailed analysis scenarios addressing the identified challenges. Our\nmethodology validates existing results and offers additional insights.\n","authors":["Thilo Spinner","Rebecca Kehlbeck","Rita Sevastjanova","Tobias Stähle","Daniel A. Keim","Oliver Deussen","Andreas Spitz","Mennatallah El-Assady"],"pdf_url":"https://arxiv.org/pdf/2310.11252v1.pdf","comment":"9 pages paper, 2 pages references, 7 figures"},{"id":"http://arxiv.org/abs/2310.11248v1","updated":"2023-10-17T13:18:01Z","published":"2023-10-17T13:18:01Z","title":"CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code\n Completion","summary":" Code completion models have made significant progress in recent years, yet\ncurrent popular evaluation datasets, such as HumanEval and MBPP, predominantly\nfocus on code completion tasks within a single file. This over-simplified\nsetting falls short of representing the real-world software development\nscenario where repositories span multiple files with numerous cross-file\ndependencies, and accessing and understanding cross-file context is often\nrequired to complete the code correctly.\n To fill in this gap, we propose CrossCodeEval, a diverse and multilingual\ncode completion benchmark that necessitates an in-depth cross-file contextual\nunderstanding to complete the code accurately. CrossCodeEval is built on a\ndiverse set of real-world, open-sourced, permissively-licensed repositories in\nfour popular programming languages: Python, Java, TypeScript, and C#. To create\nexamples that strictly require cross-file context for accurate completion, we\npropose a straightforward yet efficient static-analysis-based approach to\npinpoint the use of cross-file context within the current file.\n Extensive experiments on state-of-the-art code language models like CodeGen\nand StarCoder demonstrate that CrossCodeEval is extremely challenging when the\nrelevant cross-file context is absent, and we see clear improvements when\nadding these context into the prompt. However, despite such improvements, the\npinnacle of performance remains notably unattained even with the\nhighest-performing model, indicating that CrossCodeEval is also capable of\nassessing model's capability in leveraging extensive context to make better\ncode completion. Finally, we benchmarked various methods in retrieving\ncross-file context, and show that CrossCodeEval can also be used to measure the\ncapability of code retrievers.\n","authors":["Yangruibo Ding","Zijian Wang","Wasi Uddin Ahmad","Hantian Ding","Ming Tan","Nihal Jain","Murali Krishna Ramanathan","Ramesh Nallapati","Parminder Bhatia","Dan Roth","Bing Xiang"],"pdf_url":"https://arxiv.org/pdf/2310.11248v1.pdf","comment":"To appear at NeurIPS 2023 (Datasets and Benchmarks Track)"},{"id":"http://arxiv.org/abs/2310.11244v1","updated":"2023-10-17T13:12:32Z","published":"2023-10-17T13:12:32Z","title":"Entity Matching using Large Language Models","summary":" Entity Matching is the task of deciding whether two entity descriptions refer\nto the same real-world entity. Entity Matching is a central step in most data\nintegration pipelines and an enabler for many e-commerce applications which\nrequire to match products offers from different vendors. State-of-the-art\nentity matching methods often rely on pre-trained language models (PLMs) such\nas BERT or RoBERTa. Two major drawbacks of these models for entity matching are\nthat (i) the models require significant amounts of task-specific training data\nand (ii) the fine-tuned models are not robust concerning out-of-distribution\nentities. In this paper, we investigate using large language models (LLMs) for\nentity matching as a less domain-specific training data reliant and more robust\nalternative to PLM-based matchers. Our study covers hosted LLMs, such as GPT3.5\nand GPT4, as well as open source LLMs based on Llama2 which can be run locally.\nWe evaluate these models in a zero-shot scenario as well as a scenario where\ntask-specific training data is available. We compare different prompt designs\nas well as the prompt sensitivity of the models in the zero-shot scenario. We\ninvestigate (i) the selection of in-context demonstrations, (ii) the generation\nof matching rules, as well as (iii) fine-tuning GPT3.5 in the second scenario\nusing the same pool of training data across the different approaches. Our\nexperiments show that GPT4 without any task-specific training data outperforms\nfine-tuned PLMs (RoBERTa and Ditto) on three out of five benchmark datasets\nreaching F1 scores around 90%. The experiments with in-context learning and\nrule generation show that all models beside of GPT4 benefit from these\ntechniques (on average 5.9% and 2.2% F1), while GPT4 does not need such\nadditional guidance in most cases...\n","authors":["Ralph Peeters","Christian Bizer"],"pdf_url":"https://arxiv.org/pdf/2310.11244v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.12053v4","updated":"2023-10-17T13:10:05Z","published":"2023-09-21T13:20:13Z","title":"AceGPT, Localizing Large Language Models in Arabic","summary":" This paper is devoted to the development of a localized Large Language Model\n(LLM) specifically for Arabic, a language imbued with unique cultural\ncharacteristics inadequately addressed by current mainstream models.\nSignificant concerns emerge when addressing cultural sensitivity and local\nvalues. To address this, the paper proposes a comprehensive solution that\nincludes further pre-training with Arabic texts, Supervised Fine-Tuning (SFT)\nutilizing native Arabic instructions, and GPT-4 responses in Arabic, alongside\nReinforcement Learning with AI Feedback (RLAIF) employing a reward model\nattuned to local culture and values. The goal is to cultivate culturally\ncognizant and value-aligned Arabic LLMs capable of accommodating the diverse,\napplication-specific needs of Arabic-speaking communities. Comprehensive\nevaluations reveal that the resulting model, dubbed 'AceGPT', sets the\nstate-of-the-art standard for open Arabic LLMs across various benchmarks,\nincluding the instruction-following benchmark (i.e., Arabic Vicuna-80 and\nArabic AlpacaEval), knowledge benchmark (i.e., Arabic MMLU and EXAMs), and the\nnewly introduced Arabic Cultural and Value Alignment benchmark. Notably, AceGPT\noutperforms Turbo in the popular Vicuna-80 benchmark when evaluated with GPT-4,\ndespite the benchmark's limited scale. Codes, data, and models are in\nhttps://github.com/FreedomIntelligence/AceGPT.\n","authors":["Huang Huang","Fei Yu","Jianqing Zhu","Xuening Sun","Hao Cheng","Dingjie Song","Zhihong Chen","Abdulmohsen Alharthi","Bang An","Juncai He","Ziche Liu","Zhiyi Zhang","Junying Chen","Jianquan Li","Benyou Wang","Lian Zhang","Ruoyu Sun","Xiang Wan","Haizhou Li","Jinchao Xu"],"pdf_url":"https://arxiv.org/pdf/2309.12053v4.pdf","comment":"https://github.com/FreedomIntelligence/AceGPT"},{"id":"http://arxiv.org/abs/2310.11237v1","updated":"2023-10-17T13:06:59Z","published":"2023-10-17T13:06:59Z","title":"Watermarking LLMs with Weight Quantization","summary":" Abuse of large language models reveals high risks as large language models\nare being deployed at an astonishing speed. It is important to protect the\nmodel weights to avoid malicious usage that violates licenses of open-source\nlarge language models. This paper proposes a novel watermarking strategy that\nplants watermarks in the quantization process of large language models without\npre-defined triggers during inference. The watermark works when the model is\nused in the fp32 mode and remains hidden when the model is quantized to int8,\nin this way, the users can only inference the model without further supervised\nfine-tuning of the model. We successfully plant the watermark into open-source\nlarge language model weights including GPT-Neo and LLaMA. We hope our proposed\nmethod can provide a potential direction for protecting model weights in the\nera of large language model applications.\n","authors":["Linyang Li","Botian Jiang","Pengyu Wang","Ke Ren","Hang Yan","Xipeng Qiu"],"pdf_url":"https://arxiv.org/pdf/2310.11237v1.pdf","comment":"Accepted by Findings of EMNLP2023"},{"id":"http://arxiv.org/abs/2310.11227v1","updated":"2023-10-17T12:58:17Z","published":"2023-10-17T12:58:17Z","title":"RealBehavior: A Framework for Faithfully Characterizing Foundation\n Models' Human-like Behavior Mechanisms","summary":" Reports of human-like behaviors in foundation models are growing, with\npsychological theories providing enduring tools to investigate these behaviors.\nHowever, current research tends to directly apply these human-oriented tools\nwithout verifying the faithfulness of their outcomes. In this paper, we\nintroduce a framework, RealBehavior, which is designed to characterize the\nhumanoid behaviors of models faithfully. Beyond simply measuring behaviors, our\nframework assesses the faithfulness of results based on reproducibility,\ninternal and external consistency, and generalizability. Our findings suggest\nthat a simple application of psychological tools cannot faithfully characterize\nall human-like behaviors. Moreover, we discuss the impacts of aligning models\nwith human and social values, arguing for the necessity of diversifying\nalignment objectives to prevent the creation of models with restricted\ncharacteristics.\n","authors":["Enyu Zhou","Rui Zheng","Zhiheng Xi","Songyang Gao","Xiaoran Fan","Zichu Fei","Jingting Ye","Tao Gui","Qi Zhang","Xuanjing Huang"],"pdf_url":"https://arxiv.org/pdf/2310.11227v1.pdf","comment":"Accepted to Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.08943v2","updated":"2023-10-17T12:53:58Z","published":"2023-10-13T08:16:27Z","title":"Multi-level Adaptive Contrastive Learning for Knowledge Internalization\n in Dialogue Generation","summary":" Knowledge-grounded dialogue generation aims to mitigate the issue of text\ndegeneration by incorporating external knowledge to supplement the context.\nHowever, the model often fails to internalize this information into responses\nin a human-like manner. Instead, it simply inserts segments of the provided\nknowledge into generic responses. As a result, the generated responses tend to\nbe tedious, incoherent, and in lack of interactivity which means the\ndegeneration problem is still unsolved. In this work, we first find that such\ncopying-style degeneration is primarily due to the weak likelihood objective,\nwhich allows the model to \"cheat\" the objective by merely duplicating knowledge\nsegments in a superficial pattern matching based on overlap. To overcome this\nchallenge, we then propose a Multi-level Adaptive Contrastive Learning (MACL)\nframework that dynamically samples negative examples and subsequently penalizes\ndegeneration behaviors at both the token-level and sequence-level. Extensive\nexperiments on the WoW dataset demonstrate the effectiveness of our approach\nacross various pre-trained models.\n","authors":["Chenxu Yang","Zheng Lin","Lanrui Wang","Chong Tian","Liang Pang","Jiangnan Li","Qirong Ho","Yanan Cao","Weiping Wang"],"pdf_url":"https://arxiv.org/pdf/2310.08943v2.pdf","comment":"Accepted by EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.11220v1","updated":"2023-10-17T12:51:35Z","published":"2023-10-17T12:51:35Z","title":"KG-GPT: A General Framework for Reasoning on Knowledge Graphs Using\n Large Language Models","summary":" While large language models (LLMs) have made considerable advancements in\nunderstanding and generating unstructured text, their application in structured\ndata remains underexplored. Particularly, using LLMs for complex reasoning\ntasks on knowledge graphs (KGs) remains largely untouched. To address this, we\npropose KG-GPT, a multi-purpose framework leveraging LLMs for tasks employing\nKGs. KG-GPT comprises three steps: Sentence Segmentation, Graph Retrieval, and\nInference, each aimed at partitioning sentences, retrieving relevant graph\ncomponents, and deriving logical conclusions, respectively. We evaluate KG-GPT\nusing KG-based fact verification and KGQA benchmarks, with the model showing\ncompetitive and robust performance, even outperforming several fully-supervised\nmodels. Our work, therefore, marks a significant step in unifying structured\nand unstructured data processing within the realm of LLMs.\n","authors":["Jiho Kim","Yeonsu Kwon","Yohan Jo","Edward Choi"],"pdf_url":"https://arxiv.org/pdf/2310.11220v1.pdf","comment":"Accepted to EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2310.11207v1","updated":"2023-10-17T12:34:32Z","published":"2023-10-17T12:34:32Z","title":"Can Large Language Models Explain Themselves? A Study of LLM-Generated\n Self-Explanations","summary":" Large language models (LLMs) such as ChatGPT have demonstrated superior\nperformance on a variety of natural language processing (NLP) tasks including\nsentiment analysis, mathematical reasoning and summarization. Furthermore,\nsince these models are instruction-tuned on human conversations to produce\n\"helpful\" responses, they can and often will produce explanations along with\nthe response, which we call self-explanations. For example, when analyzing the\nsentiment of a movie review, the model may output not only the positivity of\nthe sentiment, but also an explanation (e.g., by listing the sentiment-laden\nwords such as \"fantastic\" and \"memorable\" in the review). How good are these\nautomatically generated self-explanations? In this paper, we investigate this\nquestion on the task of sentiment analysis and for feature attribution\nexplanation, one of the most commonly studied settings in the interpretability\nliterature (for pre-ChatGPT models). Specifically, we study different ways to\nelicit the self-explanations, evaluate their faithfulness on a set of\nevaluation metrics, and compare them to traditional explanation methods such as\nocclusion or LIME saliency maps. Through an extensive set of experiments, we\nfind that ChatGPT's self-explanations perform on par with traditional ones, but\nare quite different from them according to various agreement metrics, meanwhile\nbeing much cheaper to produce (as they are generated along with the\nprediction). In addition, we identified several interesting characteristics of\nthem, which prompt us to rethink many current model interpretability practices\nin the era of ChatGPT(-like) LLMs.\n","authors":["Shiyuan Huang","Siddarth Mamidanna","Shreedhar Jangam","Yilun Zhou","Leilani H. Gilpin"],"pdf_url":"https://arxiv.org/pdf/2310.11207v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.15452v3","updated":"2023-10-17T12:17:41Z","published":"2023-08-29T17:22:39Z","title":"When Do Program-of-Thoughts Work for Reasoning?","summary":" The reasoning capabilities of Large Language Models (LLMs) play a pivotal\nrole in the realm of embodied artificial intelligence. Although there are\neffective methods like program-of-thought prompting for LLMs which uses\nprogramming language to tackle complex reasoning tasks, the specific impact of\ncode data on the improvement of reasoning capabilities remains under-explored.\nTo address this gap, we propose complexity-impacted reasoning score (CIRS),\nwhich combines structural and logical attributes, to measure the correlation\nbetween code and reasoning abilities. Specifically, we use the abstract syntax\ntree to encode the structural information and calculate logical complexity by\nconsidering the difficulty and the cyclomatic complexity. Through an empirical\nanalysis, we find not all code data of complexity can be learned or understood\nby LLMs. Optimal level of complexity is critical to the improvement of\nreasoning abilities by program-aided prompting. Then we design an\nauto-synthesizing and stratifying algorithm, and apply it to instruction\ngeneration for mathematical reasoning and code data filtering for code\ngeneration tasks. Extensive results demonstrates the effectiveness of our\nproposed approach. Code will be integrated into the EasyInstruct framework at\nhttps://github.com/zjunlp/EasyInstruct.\n","authors":["Zhen Bi","Ningyu Zhang","Yinuo Jiang","Shumin Deng","Guozhou Zheng","Huajun Chen"],"pdf_url":"https://arxiv.org/pdf/2308.15452v3.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2305.11490v4","updated":"2023-10-17T12:16:03Z","published":"2023-05-19T07:44:39Z","title":"LLM-CXR: Instruction-Finetuned LLM for CXR Image Understanding and\n Generation","summary":" Following the impressive development of LLMs, vision-language alignment in\nLLMs is actively being researched to enable multimodal reasoning and visual IO.\nThis direction of research is particularly relevant to medical imaging because\nmedical image analysis and generation consist of reasoning based on a\ncombination of visual features and prior knowledge. Many recent works have\nfocused on training adapter networks that serve as an information bridge\nbetween image processing networks and LLMs; but presumably, in order to achieve\nmaximum reasoning potential of LLMs on visual information as well, visual and\nlanguage features should be allowed to interact more freely. This is especially\nimportant in the medical domain because understanding and generating medical\nimages such as chest X-rays (CXR) require not only accurate visual and\nlanguage-based reasoning but also a more intimate mapping between the two\nmodalities. Thus, taking inspiration from previous work on the transformer and\nVQ-GAN combination for bidirectional image and text generation, we build upon\nthis approach and develop a method for instruction-tuning an LLM pre-trained\nonly on text to gain vision-language capabilities for medical images.\nSpecifically, we leverage a pretrained LLM's existing question-answering and\ninstruction-following abilities to teach it to understand visual inputs by\ninstructing it to answer questions about image inputs and, symmetrically,\noutput both text and image responses appropriate to a given query by tuning the\nLLM with diverse tasks that encompass image-based text-generation and\ntext-based image-generation. We show that our model, LLM-CXR, trained in this\napproach shows better image-text alignment in both CXR understanding and\ngeneration tasks while being smaller in size compared to previously developed\nmodels that perform a narrower range of tasks. The code is at\nhttps://github.com/hyn2028/llm-cxr.\n","authors":["Suhyeon Lee","Won Jun Kim","Jinho Chang","Jong Chul Ye"],"pdf_url":"https://arxiv.org/pdf/2305.11490v4.pdf","comment":"20 pages, 8 figures"},{"id":"http://arxiv.org/abs/2310.11191v1","updated":"2023-10-17T12:14:03Z","published":"2023-10-17T12:14:03Z","title":"Medical Text Simplification: Optimizing for Readability with\n Unlikelihood Training and Reranked Beam Search Decoding","summary":" Text simplification has emerged as an increasingly useful application of AI\nfor bridging the communication gap in specialized fields such as medicine,\nwhere the lexicon is often dominated by technical jargon and complex\nconstructs. Despite notable progress, methods in medical simplification\nsometimes result in the generated text having lower quality and diversity. In\nthis work, we explore ways to further improve the readability of text\nsimplification in the medical domain. We propose (1) a new unlikelihood loss\nthat encourages generation of simpler terms and (2) a reranked beam search\ndecoding method that optimizes for simplicity, which achieve better performance\non readability metrics on three datasets. This study's findings offer promising\navenues for improving text simplification in the medical field.\n","authors":["Lorenzo Jaime Yu Flores","Heyuan Huang","Kejian Shi","Sophie Chheang","Arman Cohan"],"pdf_url":"https://arxiv.org/pdf/2310.11191v1.pdf","comment":"EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2305.13303v2","updated":"2023-10-17T11:58:37Z","published":"2023-05-22T17:58:04Z","title":"Towards Unsupervised Recognition of Semantic Differences in Related\n Documents","summary":" Automatically highlighting words that cause semantic differences between two\ndocuments could be useful for a wide range of applications. We formulate\nrecognizing semantic differences (RSD) as a token-level regression task and\nstudy three unsupervised approaches that rely on a masked language model. To\nassess the approaches, we begin with basic English sentences and gradually move\nto more complex, cross-lingual document pairs. Our results show that an\napproach based on word alignment and sentence-level contrastive learning has a\nrobust correlation to gold labels. However, all unsupervised approaches still\nleave a large margin of improvement. Code to reproduce our experiments is\navailable at https://github.com/ZurichNLP/recognizing-semantic-differences\n","authors":["Jannis Vamvas","Rico Sennrich"],"pdf_url":"https://arxiv.org/pdf/2305.13303v2.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.11166v1","updated":"2023-10-17T11:34:50Z","published":"2023-10-17T11:34:50Z","title":"ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text\n Processing","summary":" English and Chinese, known as resource-rich languages, have witnessed the\nstrong development of transformer-based language models for natural language\nprocessing tasks. Although Vietnam has approximately 100M people speaking\nVietnamese, several pre-trained models, e.g., PhoBERT, ViBERT, and vELECTRA,\nperformed well on general Vietnamese NLP tasks, including POS tagging and named\nentity recognition. These pre-trained language models are still limited to\nVietnamese social media tasks. In this paper, we present the first monolingual\npre-trained language model for Vietnamese social media texts, ViSoBERT, which\nis pre-trained on a large-scale corpus of high-quality and diverse Vietnamese\nsocial media texts using XLM-R architecture. Moreover, we explored our\npre-trained model on five important natural language downstream tasks on\nVietnamese social media texts: emotion recognition, hate speech detection,\nsentiment analysis, spam reviews detection, and hate speech spans detection.\nOur experiments demonstrate that ViSoBERT, with far fewer parameters, surpasses\nthe previous state-of-the-art models on multiple Vietnamese social media tasks.\nOur ViSoBERT model is\navailable\\footnote{\\url{https://huggingface.co/uitnlp/visobert}} only for\nresearch purposes.\n","authors":["Quoc-Nam Nguyen","Thang Chau Phan","Duc-Vu Nguyen","Kiet Van Nguyen"],"pdf_url":"https://arxiv.org/pdf/2310.11166v1.pdf","comment":"Accepted at EMNLP'2023 Main Conference"},{"id":"http://arxiv.org/abs/2310.11163v1","updated":"2023-10-17T11:29:04Z","published":"2023-10-17T11:29:04Z","title":"IMTLab: An Open-Source Platform for Building, Evaluating, and Diagnosing\n Interactive Machine Translation Systems","summary":" We present IMTLab, an open-source end-to-end interactive machine translation\n(IMT) system platform that enables researchers to quickly build IMT systems\nwith state-of-the-art models, perform an end-to-end evaluation, and diagnose\nthe weakness of systems. IMTLab treats the whole interactive translation\nprocess as a task-oriented dialogue with a human-in-the-loop setting, in which\nhuman interventions can be explicitly incorporated to produce high-quality,\nerror-free translations. To this end, a general communication interface is\ndesigned to support the flexible IMT architectures and user policies. Based on\nthe proposed design, we construct a simulated and real interactive environment\nto achieve end-to-end evaluation and leverage the framework to systematically\nevaluate previous IMT systems. Our simulated and manual experiments show that\nthe prefix-constrained decoding approach still gains the lowest editing cost in\nthe end-to-end evaluation, while BiTIIMT achieves comparable editing cost with\na better interactive experience.\n","authors":["Xu Huang","Zhirui Zhang","Ruize Gao","Yichao Du","Lemao Liu","Gouping Huang","Shuming Shi","Jiajun Chen","Shujian Huang"],"pdf_url":"https://arxiv.org/pdf/2310.11163v1.pdf","comment":"Accepted by EMNLP2023"},{"id":"http://arxiv.org/abs/2301.11916v3","updated":"2023-10-17T11:24:33Z","published":"2023-01-27T18:59:01Z","title":"Large Language Models Are Latent Variable Models: Explaining and Finding\n Good Demonstrations for In-Context Learning","summary":" In recent years, pre-trained large language models (LLMs) have demonstrated\nremarkable efficiency in achieving an inference-time few-shot learning\ncapability known as in-context learning. However, existing literature has\nhighlighted the sensitivity of this capability to the selection of few-shot\ndemonstrations. Current understandings of the underlying mechanisms by which\nthis capability arises from regular language model pretraining objectives\nremain disconnected from the real-world LLMs. This study aims to examine the\nin-context learning phenomenon through a Bayesian lens, viewing real-world LLMs\nas latent variable models. On this premise, we propose an algorithm to select\noptimal demonstrations from a set of annotated data with a small LM, and then\ndirectly generalize the selected demonstrations to larger LMs. We demonstrate\nsignificant improvement over baselines, averaged over eight GPT models on eight\nreal-world text classification datasets. We also demonstrate the real-world\nusefulness of our algorithm on GSM8K, a math word problem dataset. Our\nempirical findings support our hypothesis that LLMs implicitly infer a latent\nvariable containing task information.\n","authors":["Xinyi Wang","Wanrong Zhu","Michael Saxon","Mark Steyvers","William Yang Wang"],"pdf_url":"https://arxiv.org/pdf/2301.11916v3.pdf","comment":"code at:\n https://github.com/WANGXinyiLinda/concept-based-demonstration-selection\n Accepted to NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.11158v1","updated":"2023-10-17T11:23:32Z","published":"2023-10-17T11:23:32Z","title":"Probing the Creativity of Large Language Models: Can models produce\n divergent semantic association?","summary":" Large language models possess remarkable capacity for processing language,\nbut it remains unclear whether these models can further generate creative\ncontent. The present study aims to investigate the creative thinking of large\nlanguage models through a cognitive perspective. We utilize the divergent\nassociation task (DAT), an objective measurement of creativity that asks models\nto generate unrelated words and calculates the semantic distance between them.\nWe compare the results across different models and decoding strategies. Our\nfindings indicate that: (1) When using the greedy search strategy, GPT-4\noutperforms 96% of humans, while GPT-3.5-turbo exceeds the average human level.\n(2) Stochastic sampling and temperature scaling are effective to obtain higher\nDAT scores for models except GPT-4, but face a trade-off between creativity and\nstability. These results imply that advanced large language models have\ndivergent semantic associations, which is a fundamental process underlying\ncreativity.\n","authors":["Honghua Chen","Nai Ding"],"pdf_url":"https://arxiv.org/pdf/2310.11158v1.pdf","comment":"Accepted for publication in Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.11146v1","updated":"2023-10-17T10:54:24Z","published":"2023-10-17T10:54:24Z","title":"The Quo Vadis of the Relationship between Language and Large Language\n Models","summary":" In the field of Artificial (General) Intelligence (AI), the several recent\nadvancements in Natural language processing (NLP) activities relying on Large\nLanguage Models (LLMs) have come to encourage the adoption of LLMs as\nscientific models of language. While the terminology employed for the\ncharacterization of LLMs favors their embracing as such, it is not clear that\nthey are in a place to offer insights into the target system they seek to\nrepresent. After identifying the most important theoretical and empirical risks\nbrought about by the adoption of scientific models that lack transparency, we\ndiscuss LLMs relating them to every scientific model's fundamental components:\nthe object, the medium, the meaning and the user. We conclude that, at their\ncurrent stage of development, LLMs hardly offer any explanations for language,\nand then we provide an outlook for more informative future research directions\non this topic.\n","authors":["Evelina Leivada","Vittoria Dentella","Elliot Murphy"],"pdf_url":"https://arxiv.org/pdf/2310.11146v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11141v1","updated":"2023-10-17T10:44:05Z","published":"2023-10-17T10:44:05Z","title":"Long-form Simultaneous Speech Translation: Thesis Proposal","summary":" Simultaneous speech translation (SST) aims to provide real-time translation\nof spoken language, even before the speaker finishes their sentence.\nTraditionally, SST has been addressed primarily by cascaded systems that\ndecompose the task into subtasks, including speech recognition, segmentation,\nand machine translation. However, the advent of deep learning has sparked\nsignificant interest in end-to-end (E2E) systems. Nevertheless, a major\nlimitation of most approaches to E2E SST reported in the current literature is\nthat they assume that the source speech is pre-segmented into sentences, which\nis a significant obstacle for practical, real-world applications. This thesis\nproposal addresses end-to-end simultaneous speech translation, particularly in\nthe long-form setting, i.e., without pre-segmentation. We present a survey of\nthe latest advancements in E2E SST, assess the primary obstacles in SST and its\nrelevance to long-form scenarios, and suggest approaches to tackle these\nchallenges.\n","authors":["Peter Polák"],"pdf_url":"https://arxiv.org/pdf/2310.11141v1.pdf","comment":"IJCNLP-AACL SRW 2023 - camera-ready version"},{"id":"http://arxiv.org/abs/2306.17181v4","updated":"2023-10-17T10:41:12Z","published":"2023-06-19T10:22:12Z","title":"Unsupervised Text Embedding Space Generation Using Generative\n Adversarial Networks for Text Synthesis","summary":" Generative Adversarial Networks (GAN) is a model for data synthesis, which\ncreates plausible data through the competition of generator and discriminator.\nAlthough GAN application to image synthesis is extensively studied, it has\ninherent limitations to natural language generation. Because natural language\nis composed of discrete tokens, a generator has difficulty updating its\ngradient through backpropagation; therefore, most text-GAN studies generate\nsentences starting with a random token based on a reward system. Thus, the\ngenerators of previous studies are pre-trained in an autoregressive way before\nadversarial training, causing data memorization that synthesized sentences\nreproduce the training data. In this paper, we synthesize sentences using a\nframework similar to the original GAN. More specifically, we propose Text\nEmbedding Space Generative Adversarial Networks (TESGAN) which generate\ncontinuous text embedding spaces instead of discrete tokens to solve the\ngradient backpropagation problem. Furthermore, TESGAN conducts unsupervised\nlearning which does not directly refer to the text of the training data to\novercome the data memorization issue. By adopting this novel method, TESGAN can\nsynthesize new sentences, showing the potential of unsupervised learning for\ntext synthesis. We expect to see extended research combining Large Language\nModels with a new perspective of viewing text as an continuous space.\n","authors":["Jun-Min Lee","Tae-Bin Ha"],"pdf_url":"https://arxiv.org/pdf/2306.17181v4.pdf","comment":"NEJLT accpeted"},{"id":"http://arxiv.org/abs/2310.09619v2","updated":"2023-10-17T10:10:49Z","published":"2023-10-14T17:00:28Z","title":"An Expression Tree Decoding Strategy for Mathematical Equation\n Generation","summary":" Generating mathematical equations from natural language requires an accurate\nunderstanding of the relations among math expressions. Existing approaches can\nbe broadly categorized into token-level and expression-level generation. The\nformer treats equations as a mathematical language, sequentially generating\nmath tokens. Expression-level methods generate each expression one by one.\nHowever, each expression represents a solving step, and there naturally exist\nparallel or dependent relations between these steps, which are ignored by\ncurrent sequential methods. Therefore, we integrate tree structure into the\nexpression-level generation and advocate an expression tree decoding strategy.\nTo generate a tree with expression as its node, we employ a layer-wise parallel\ndecoding strategy: we decode multiple independent expressions (leaf nodes) in\nparallel at each layer and repeat parallel decoding layer by layer to\nsequentially generate these parent node expressions that depend on others.\nBesides, a bipartite matching algorithm is adopted to align multiple\npredictions with annotations for each layer. Experiments show our method\noutperforms other baselines, especially for these equations with complex\nstructures.\n","authors":["Wenqi Zhang","Yongliang Shen","Qingpeng Nong","Zeqi Tan Yanna Ma","Weiming Lu"],"pdf_url":"https://arxiv.org/pdf/2310.09619v2.pdf","comment":"Accepted to EMNLP-2023, camera-ready version"},{"id":"http://arxiv.org/abs/2304.13734v2","updated":"2023-10-17T09:34:30Z","published":"2023-04-26T02:49:38Z","title":"The Internal State of an LLM Knows When It's Lying","summary":" While Large Language Models (LLMs) have shown exceptional performance in\nvarious tasks, one of their most prominent drawbacks is generating inaccurate\nor false information with a confident tone. In this paper, we provide evidence\nthat the LLM's internal state can be used to reveal the truthfulness of\nstatements. This includes both statements provided to the LLM, and statements\nthat the LLM itself generates. Our approach is to train a classifier that\noutputs the probability that a statement is truthful, based on the hidden layer\nactivations of the LLM as it reads or generates the statement. Experiments\ndemonstrate that given a set of test sentences, of which half are true and half\nfalse, our trained classifier achieves an average of 71\\% to 83\\% accuracy\nlabeling which sentences are true versus false, depending on the LLM base\nmodel. Furthermore, we explore the relationship between our classifier's\nperformance and approaches based on the probability assigned to the sentence by\nthe LLM. We show that while LLM-assigned sentence probability is related to\nsentence truthfulness, this probability is also dependent on sentence length\nand the frequencies of words in the sentence, resulting in our trained\nclassifier providing a more reliable approach to detecting truthfulness,\nhighlighting its potential to enhance the reliability of LLM-generated content\nand its practical applicability in real-world scenarios.\n","authors":["Amos Azaria","Tom Mitchell"],"pdf_url":"https://arxiv.org/pdf/2304.13734v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.10333v2","updated":"2023-10-17T09:33:08Z","published":"2023-10-16T12:17:11Z","title":"NLP for Crypto-Asset Regulation: A Roadmap","summary":" In the rapidly evolving field of crypto-assets, white papers are essential\ndocuments for investor guidance, and are now subject to unprecedented content\nrequirements under the EU's Markets in Crypto-Assets Regulation (MiCAR).\nNatural Language Processing can serve as a powerful tool for both analyzing\nthese documents and assisting in regulatory compliance. This paper delivers two\ncontributions to the topic. First, we survey existing applications of textual\nanalysis to unregulated crypto-asset white papers, uncovering a research gap\nthat could be bridged with interdisciplinary collaboration. We then conduct an\nanalysis of the changes introduced by MiCAR, highlighting the opportunities and\nchallenges of integrating NLP within the new regulatory framework. The findings\nset the stage for further research, with the potential to benefit regulators,\ncrypto-asset issuers, and investors.\n","authors":["Carolina Camassa"],"pdf_url":"https://arxiv.org/pdf/2310.10333v2.pdf","comment":"Accepted at NLLP23"},{"id":"http://arxiv.org/abs/2310.11097v1","updated":"2023-10-17T09:27:43Z","published":"2023-10-17T09:27:43Z","title":"Experimenting AI Technologies for Disinformation Combat: the IDMO\n Project","summary":" The Italian Digital Media Observatory (IDMO) project, part of a European\ninitiative, focuses on countering disinformation and fake news. This report\noutlines contributions from Rai-CRITS to the project, including: (i) the\ncreation of novel datasets for testing technologies (ii) development of an\nautomatic model for categorizing Pagella Politica verdicts to facilitate\nbroader analysis (iii) creation of an automatic model for recognizing textual\nentailment with exceptional accuracy on the FEVER dataset (iv) assessment using\nGPT-4 to identify textual entailmen (v) a game to raise awareness about fake\nnews at national events.\n","authors":["Lorenzo Canale","Alberto Messina"],"pdf_url":"https://arxiv.org/pdf/2310.11097v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11085v1","updated":"2023-10-17T09:10:27Z","published":"2023-10-17T09:10:27Z","title":"In-Context Few-Shot Relation Extraction via Pre-Trained Language Models","summary":" Relation extraction aims at inferring structured human knowledge from textual\ndocuments. State-of-the-art methods based on language models commonly have two\nlimitations: (1) they require named entities to be either given as input or\ninfer them, which introduces additional noise, and (2) they require human\nannotations of documents. As a remedy, we present a novel framework for\nin-context few-shot relation extraction via pre-trained language models. To the\nbest of our knowledge, we are the first to reformulate the relation extraction\ntask as a tailored in-context few-shot learning paradigm. Thereby, we achieve\ncrucial benefits in that we eliminate the need for both named entity\nrecognition and human annotation of documents. Unlike existing methods based on\nfine-tuning, our framework is flexible in that it can be easily updated for a\nnew set of relations without re-training. We evaluate our framework using\nDocRED, the largest publicly available dataset for document-level relation\nextraction, and demonstrate that our framework achieves state-of-the-art\nperformance. Finally, our framework allows us to identify missing annotations,\nand we thus show that our framework actually performs much better than the\noriginal labels from the development set of DocRED.\n","authors":["Yilmazcan Ozyurt","Stefan Feuerriegel","Ce Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.11085v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2211.17148v2","updated":"2023-10-17T09:06:16Z","published":"2022-11-30T16:37:42Z","title":"ConvLab-3: A Flexible Dialogue System Toolkit Based on a Unified Data\n Format","summary":" Task-oriented dialogue (TOD) systems function as digital assistants, guiding\nusers through various tasks such as booking flights or finding restaurants.\nExisting toolkits for building TOD systems often fall short of in delivering\ncomprehensive arrays of data, models, and experimental environments with a\nuser-friendly experience. We introduce ConvLab-3: a multifaceted dialogue\nsystem toolkit crafted to bridge this gap. Our unified data format simplifies\nthe integration of diverse datasets and models, significantly reducing\ncomplexity and cost for studying generalization and transfer. Enhanced with\nrobust reinforcement learning (RL) tools, featuring a streamlined training\nprocess, in-depth evaluation tools, and a selection of user simulators,\nConvLab-3 supports the rapid development and evaluation of robust dialogue\npolicies. Through an extensive study, we demonstrate the efficacy of transfer\nlearning and RL and showcase that ConvLab-3 is not only a powerful tool for\nseasoned researchers but also an accessible platform for newcomers.\n","authors":["Qi Zhu","Christian Geishauser","Hsien-chin Lin","Carel van Niekerk","Baolin Peng","Zheng Zhang","Michael Heck","Nurul Lubis","Dazhen Wan","Xiaochen Zhu","Jianfeng Gao","Milica Gašić","Minlie Huang"],"pdf_url":"https://arxiv.org/pdf/2211.17148v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11081v1","updated":"2023-10-17T09:01:17Z","published":"2023-10-17T09:01:17Z","title":"Understanding writing style in social media with a supervised\n contrastively pre-trained transformer","summary":" Online Social Networks serve as fertile ground for harmful behavior, ranging\nfrom hate speech to the dissemination of disinformation. Malicious actors now\nhave unprecedented freedom to misbehave, leading to severe societal unrest and\ndire consequences, as exemplified by events such as the Capitol assault during\nthe US presidential election and the Antivaxx movement during the COVID-19\npandemic. Understanding online language has become more pressing than ever.\nWhile existing works predominantly focus on content analysis, we aim to shift\nthe focus towards understanding harmful behaviors by relating content to their\nrespective authors. Numerous novel approaches attempt to learn the stylistic\nfeatures of authors in texts, but many of these approaches are constrained by\nsmall datasets or sub-optimal training losses. To overcome these limitations,\nwe introduce the Style Transformer for Authorship Representations (STAR),\ntrained on a large corpus derived from public sources of 4.5 x 10^6 authored\ntexts involving 70k heterogeneous authors. Our model leverages Supervised\nContrastive Loss to teach the model to minimize the distance between texts\nauthored by the same individual. This author pretext pre-training task yields\ncompetitive performance at zero-shot with PAN challenges on attribution and\nclustering. Additionally, we attain promising results on PAN verification\nchallenges using a single dense layer, with our model serving as an embedding\nencoder. Finally, we present results from our test partition on Reddit. Using a\nsupport base of 8 documents of 512 tokens, we can discern authors from sets of\nup to 1616 authors with at least 80\\% accuracy. We share our pre-trained model\nat huggingface (https://huggingface.co/AIDA-UPM/star) and our code is available\nat (https://github.com/jahuerta92/star)\n","authors":["Javier Huertas-Tato","Alejandro Martin","David Camacho"],"pdf_url":"https://arxiv.org/pdf/2310.11081v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11079v1","updated":"2023-10-17T08:56:04Z","published":"2023-10-17T08:56:04Z","title":"Learning from Red Teaming: Gender Bias Provocation and Mitigation in\n Large Language Models","summary":" Recently, researchers have made considerable improvements in dialogue systems\nwith the progress of large language models (LLMs) such as ChatGPT and GPT-4.\nThese LLM-based chatbots encode the potential biases while retaining\ndisparities that can harm humans during interactions. The traditional biases\ninvestigation methods often rely on human-written test cases. However, these\ntest cases are usually expensive and limited. In this work, we propose a\nfirst-of-its-kind method that automatically generates test cases to detect\nLLMs' potential gender bias. We apply our method to three well-known LLMs and\nfind that the generated test cases effectively identify the presence of biases.\nTo address the biases identified, we propose a mitigation strategy that uses\nthe generated test cases as demonstrations for in-context learning to\ncircumvent the need for parameter fine-tuning. The experimental results show\nthat LLMs generate fairer responses with the proposed approach.\n","authors":["Hsuan Su","Cheng-Chu Cheng","Hua Farn","Shachi H Kumar","Saurav Sahay","Shang-Tse Chen","Hung-yi Lee"],"pdf_url":"https://arxiv.org/pdf/2310.11079v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11069v1","updated":"2023-10-17T08:33:02Z","published":"2023-10-17T08:33:02Z","title":"VoxArabica: A Robust Dialect-Aware Arabic Speech Recognition System","summary":" Arabic is a complex language with many varieties and dialects spoken by over\n450 millions all around the world. Due to the linguistic diversity and\nvariations, it is challenging to build a robust and generalized ASR system for\nArabic. In this work, we address this gap by developing and demoing a system,\ndubbed VoxArabica, for dialect identification (DID) as well as automatic speech\nrecognition (ASR) of Arabic. We train a wide range of models such as HuBERT\n(DID), Whisper, and XLS-R (ASR) in a supervised setting for Arabic DID and ASR\ntasks. Our DID models are trained to identify 17 different dialects in addition\nto MSA. We finetune our ASR models on MSA, Egyptian, Moroccan, and mixed data.\nAdditionally, for the remaining dialects in ASR, we provide the option to\nchoose various models such as Whisper and MMS in a zero-shot setting. We\nintegrate these models into a single web interface with diverse features such\nas audio recording, file upload, model selection, and the option to raise flags\nfor incorrect outputs. Overall, we believe VoxArabica will be useful for a wide\nrange of audiences concerned with Arabic research. Our system is currently\nrunning at https://cdce-206-12-100-168.ngrok.io/.\n","authors":["Abdul Waheed","Bashar Talafha","Peter Suvellin","Abdelrahman Elmadney","Muhammad Abdul-Mageed"],"pdf_url":"https://arxiv.org/pdf/2310.11069v1.pdf","comment":"Accepted at ArabicNLP conference co-located with EMNLP'23"},{"id":"http://arxiv.org/abs/2305.14342v3","updated":"2023-10-17T07:44:16Z","published":"2023-05-23T17:59:21Z","title":"Sophia: A Scalable Stochastic Second-order Optimizer for Language Model\n Pre-training","summary":" Given the massive cost of language model pre-training, a non-trivial\nimprovement of the optimization algorithm would lead to a material reduction on\nthe time and cost of training. Adam and its variants have been state-of-the-art\nfor years, and more sophisticated second-order (Hessian-based) optimizers often\nincur too much per-step overhead. In this paper, we propose Sophia,\nSecond-order Clipped Stochastic Optimization, a simple scalable second-order\noptimizer that uses a light-weight estimate of the diagonal Hessian as the\npre-conditioner. The update is the moving average of the gradients divided by\nthe moving average of the estimated Hessian, followed by element-wise clipping.\nThe clipping controls the worst-case update size and tames the negative impact\nof non-convexity and rapid change of Hessian along the trajectory. Sophia only\nestimates the diagonal Hessian every handful of iterations, which has\nnegligible average per-step time and memory overhead. On language modeling with\nGPT models of sizes ranging from 125M to 1.5B, Sophia achieves a 2x speed-up\ncompared to Adam in the number of steps, total compute, and wall-clock time,\nachieving the same perplexity with 50% fewer steps, less total compute, and\nreduced wall-clock time. Theoretically, we show that Sophia, in a much\nsimplified setting, adapts to the heterogeneous curvatures in different\nparameter dimensions, and thus has a run-time bound that does not depend on the\ncondition number of the loss.\n","authors":["Hong Liu","Zhiyuan Li","David Hall","Percy Liang","Tengyu Ma"],"pdf_url":"https://arxiv.org/pdf/2305.14342v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11053v1","updated":"2023-10-17T07:42:40Z","published":"2023-10-17T07:42:40Z","title":"Denevil: Towards Deciphering and Navigating the Ethical Values of Large\n Language Models via Instruction Learning","summary":" Large Language Models (LLMs) have made unprecedented breakthroughs, yet their\nincreasing integration into everyday life might raise societal risks due to\ngenerated unethical content. Despite extensive study on specific issues like\nbias, the intrinsic values of LLMs remain largely unexplored from a moral\nphilosophy perspective. This work delves into ethical values utilizing Moral\nFoundation Theory. Moving beyond conventional discriminative evaluations with\npoor reliability, we propose DeNEVIL, a novel prompt generation algorithm\ntailored to dynamically exploit LLMs' value vulnerabilities and elicit the\nviolation of ethics in a generative manner, revealing their underlying value\ninclinations. On such a basis, we construct MoralPrompt, a high-quality dataset\ncomprising 2,397 prompts covering 500+ value principles, and then benchmark the\nintrinsic values across a spectrum of LLMs. We discovered that most models are\nessentially misaligned, necessitating further ethical value alignment. In\nresponse, we develop VILMO, an in-context alignment method that substantially\nenhances the value compliance of LLM outputs by learning to generate\nappropriate value instructions, outperforming existing competitors. Our methods\nare suitable for black-box and open-source models, offering a promising initial\nstep in studying the ethical values of LLMs.\n","authors":["Shitong Duan","Xiaoyuan Yi","Peng Zhang","Tun Lu","Xing Xie","Ning Gu"],"pdf_url":"https://arxiv.org/pdf/2310.11053v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11049v1","updated":"2023-10-17T07:35:11Z","published":"2023-10-17T07:35:11Z","title":"Nonet at SemEval-2023 Task 6: Methodologies for Legal Evaluation","summary":" This paper describes our submission to the SemEval-2023 for Task 6 on\nLegalEval: Understanding Legal Texts. Our submission concentrated on three\nsubtasks: Legal Named Entity Recognition (L-NER) for Task-B, Legal Judgment\nPrediction (LJP) for Task-C1, and Court Judgment Prediction with Explanation\n(CJPE) for Task-C2. We conducted various experiments on these subtasks and\npresented the results in detail, including data statistics and methodology. It\nis worth noting that legal tasks, such as those tackled in this research, have\nbeen gaining importance due to the increasing need to automate legal analysis\nand support. Our team obtained competitive rankings of 15$^{th}$, 11$^{th}$,\nand 1$^{st}$ in Task-B, Task-C1, and Task-C2, respectively, as reported on the\nleaderboard.\n","authors":["Shubham Kumar Nigam","Aniket Deroy","Noel Shallum","Ayush Kumar Mishra","Anup Roy","Shubham Kumar Mishra","Arnab Bhattacharya","Saptarshi Ghosh","Kripabandhu Ghosh"],"pdf_url":"https://arxiv.org/pdf/2310.11049v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2105.01331v3","updated":"2023-10-17T07:30:40Z","published":"2021-05-04T07:27:42Z","title":"BLM-17m: A Large-Scale Dataset for Black Lives Matter Topic Detection on\n Twitter","summary":" Protection of human rights is one of the most important problems of our\nworld. In this paper, our aim is to provide a dataset which covers one of the\nmost significant human rights contradiction in recent months affected the whole\nworld, George Floyd incident. We propose a labeled dataset for topic detection\nthat contains 17 million tweets. These Tweets are collected from 25 May 2020 to\n21 August 2020 that covers 89 days from start of this incident. We labeled the\ndataset by monitoring most trending news topics from global and local\nnewspapers. Apart from that, we present two baselines, TF-IDF and LDA. We\nevaluated the results of these two methods with three different k values for\nmetrics of precision, recall and f1-score. The collected dataset is available\nat https://github.com/MeysamAsgariC/BLMT.\n","authors":["Hasan Kemik","Nusret Özateş","Meysam Asgari-Chenaghlu","Yang Li","Erik Cambria"],"pdf_url":"https://arxiv.org/pdf/2105.01331v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.13780v3","updated":"2023-10-17T07:21:05Z","published":"2023-03-24T03:35:21Z","title":"Towards Making the Most of ChatGPT for Machine Translation","summary":" ChatGPT shows remarkable capabilities for machine translation (MT). Several\nprior studies have shown that it achieves comparable results to commercial\nsystems for high-resource languages, but lags behind in complex tasks, e.g.,\nlow-resource and distant-language-pairs translation. However, they usually\nadopt simple prompts which can not fully elicit the capability of ChatGPT. In\nthis paper, we aim to further mine ChatGPT's translation ability by revisiting\nseveral aspects: temperature, task information, and domain information, and\ncorrespondingly propose an optimal temperature setting and two (simple but\neffective) prompts: Task-Specific Prompts (TSP) and Domain-Specific Prompts\n(DSP). We show that: 1) The performance of ChatGPT depends largely on\ntemperature, and a lower temperature usually can achieve better performance; 2)\nEmphasizing the task information can further improve ChatGPT's performance,\nparticularly in complex MT tasks; 3) Introducing domain information can elicit\nChatGPT's generalization ability and improve its performance in the specific\ndomain; 4) ChatGPT tends to generate hallucinations for non-English-centric MT\ntasks, which can be partially addressed by our proposed prompts but still need\nto be highlighted for the MT/NLP community. We also explore the effects of\nadvanced in-context learning strategies and find a (negative but interesting)\nobservation: the powerful chain-of-thought prompt leads to word-by-word\ntranslation behavior, thus bringing significant translation degradation.\n","authors":["Keqin Peng","Liang Ding","Qihuang Zhong","Li Shen","Xuebo Liu","Min Zhang","Yuanxin Ouyang","Dacheng Tao"],"pdf_url":"https://arxiv.org/pdf/2303.13780v3.pdf","comment":"EMNLP 2023 (findings)"},{"id":"http://arxiv.org/abs/2310.11035v1","updated":"2023-10-17T07:02:26Z","published":"2023-10-17T07:02:26Z","title":"Lyricist-Singer Entropy Affects Lyric-Lyricist Classification\n Performance","summary":" Although lyrics represent an essential component of music, few music\ninformation processing studies have been conducted on the characteristics of\nlyricists. Because these characteristics may be valuable for musical\napplications, such as recommendations, they warrant further study. We\nconsidered a potential method that extracts features representing the\ncharacteristics of lyricists from lyrics. Because these features must be\nidentified prior to extraction, we focused on lyricists with easily\nidentifiable features. We believe that it is desirable for singers to perform\nunique songs that share certain characteristics specific to the singer.\nAccordingly, we hypothesized that lyricists account for the unique\ncharacteristics of the singers they write lyrics for. In other words,\nlyric-lyricist classification performance or the ease of capturing the features\nof a lyricist from the lyrics may depend on the variety of singers. In this\nstudy, we observed a relationship between lyricist-singer entropy or the\nvariety of singers associated with a single lyricist and lyric-lyricist\nclassification performance. As an example, the lyricist-singer entropy is\nminimal when the lyricist writes lyrics for only one singer. In our\nexperiments, we grouped lyricists among five groups in terms of lyricist-singer\nentropy and assessed the lyric-lyricist classification performance within each\ngroup. Consequently, the best F1 score was obtained for the group with the\nlowest lyricist-singer entropy. Our results suggest that further analyses of\nthe features contributing to lyric-lyricist classification performance on the\nlowest lyricist-singer entropy group may improve the feature extraction task\nfor lyricists.\n","authors":["Mitsuki Morita","Masato Kikuchi","Tadachika Ozono"],"pdf_url":"https://arxiv.org/pdf/2310.11035v1.pdf","comment":"The 10th International Conference on Advanced Informatics: Concepts,\n Theory and Applications (ICAICTA 2023)"},{"id":"http://arxiv.org/abs/2310.11026v1","updated":"2023-10-17T06:53:00Z","published":"2023-10-17T06:53:00Z","title":"Exploring Automatic Evaluation Methods based on a Decoder-based LLM for\n Text Generation","summary":" Automatic evaluation of text generation is essential for improving the\naccuracy of generation tasks. In light of the current trend towards\nincreasingly larger decoder-based language models, we investigate automatic\nevaluation methods based on such models for text generation. This paper\ncompares various methods, including tuning with encoder-based models and large\nlanguage models under equal conditions, on two different tasks, machine\ntranslation evaluation and semantic textual similarity, in two languages,\nJapanese and English. Experimental results show that compared to the tuned\nencoder-based models, the tuned decoder-based models perform poorly. The\nanalysis of the causes for this suggests that the decoder-based models focus on\nsurface word sequences and do not capture meaning. It is also revealed that\nin-context learning of very large decoder-based models such as ChatGPT makes it\ndifficult to identify fine-grained semantic differences.\n","authors":["Tomohito Kasahara","Daisuke Kawahara"],"pdf_url":"https://arxiv.org/pdf/2310.11026v1.pdf","comment":"Accepted to IJCNLP-AACL 2023 SRW"},{"id":"http://arxiv.org/abs/2307.03109v8","updated":"2023-10-17T06:28:04Z","published":"2023-07-06T16:28:35Z","title":"A Survey on Evaluation of Large Language Models","summary":" Large language models (LLMs) are gaining increasing popularity in both\nacademia and industry, owing to their unprecedented performance in various\napplications. As LLMs continue to play a vital role in both research and daily\nuse, their evaluation becomes increasingly critical, not only at the task\nlevel, but also at the society level for better understanding of their\npotential risks. Over the past years, significant efforts have been made to\nexamine LLMs from various perspectives. This paper presents a comprehensive\nreview of these evaluation methods for LLMs, focusing on three key dimensions:\nwhat to evaluate, where to evaluate, and how to evaluate. Firstly, we provide\nan overview from the perspective of evaluation tasks, encompassing general\nnatural language processing tasks, reasoning, medical usage, ethics,\neducations, natural and social sciences, agent applications, and other areas.\nSecondly, we answer the `where' and `how' questions by diving into the\nevaluation methods and benchmarks, which serve as crucial components in\nassessing performance of LLMs. Then, we summarize the success and failure cases\nof LLMs in different tasks. Finally, we shed light on several future challenges\nthat lie ahead in LLMs evaluation. Our aim is to offer invaluable insights to\nresearchers in the realm of LLMs evaluation, thereby aiding the development of\nmore proficient LLMs. Our key point is that evaluation should be treated as an\nessential discipline to better assist the development of LLMs. We consistently\nmaintain the related open-source materials at:\nhttps://github.com/MLGroupJLU/LLM-eval-survey.\n","authors":["Yupeng Chang","Xu Wang","Jindong Wang","Yuan Wu","Linyi Yang","Kaijie Zhu","Hao Chen","Xiaoyuan Yi","Cunxiang Wang","Yidong Wang","Wei Ye","Yue Zhang","Yi Chang","Philip S. Yu","Qiang Yang","Xing Xie"],"pdf_url":"https://arxiv.org/pdf/2307.03109v8.pdf","comment":"31 pages; a major update to include more recent works;\n https://llm-eval.github.io/"},{"id":"http://arxiv.org/abs/2310.11016v1","updated":"2023-10-17T06:08:55Z","published":"2023-10-17T06:08:55Z","title":"Reading Order Matters: Information Extraction from Visually-rich\n Documents by Token Path Prediction","summary":" Recent advances in multimodal pre-trained models have significantly improved\ninformation extraction from visually-rich documents (VrDs), in which named\nentity recognition (NER) is treated as a sequence-labeling task of predicting\nthe BIO entity tags for tokens, following the typical setting of NLP. However,\nBIO-tagging scheme relies on the correct order of model inputs, which is not\nguaranteed in real-world NER on scanned VrDs where text are recognized and\narranged by OCR systems. Such reading order issue hinders the accurate marking\nof entities by BIO-tagging scheme, making it impossible for sequence-labeling\nmethods to predict correct named entities. To address the reading order issue,\nwe introduce Token Path Prediction (TPP), a simple prediction head to predict\nentity mentions as token sequences within documents. Alternative to token\nclassification, TPP models the document layout as a complete directed graph of\ntokens, and predicts token paths within the graph as entities. For better\nevaluation of VrD-NER systems, we also propose two revised benchmark datasets\nof NER on scanned documents which can reflect real-world scenarios. Experiment\nresults demonstrate the effectiveness of our method, and suggest its potential\nto be a universal solution to various information extraction tasks on\ndocuments.\n","authors":["Chong Zhang","Ya Guo","Yi Tu","Huan Chen","Jinyang Tang","Huijia Zhu","Qi Zhang","Tao Gui"],"pdf_url":"https://arxiv.org/pdf/2310.11016v1.pdf","comment":"Accepted as a long paper in the main conference of EMNLP 2023"},{"id":"http://arxiv.org/abs/2308.10335v4","updated":"2023-10-17T05:48:29Z","published":"2023-08-20T18:36:28Z","title":"Can ChatGPT replace StackOverflow? A Study on Robustness and Reliability\n of Large Language Model Code Generation","summary":" Recently, the large language models (LLMs) have shown extraordinary ability\nin understanding natural language and generating programming code. It has been\na common practice of software engineers to consult LLMs when encountering\ncoding questions. Although efforts have been made to avoid syntax errors and\nalign the code with the intended semantics, the reliability and robustness of\nthe code generationfrom LLMs have not yet been thoroughly studied. The\nexecutable code is not equivalent to the reliable and robust code, especially\nin the context of real-world software development. The misuse of APIs in the\ngenerated code could lead to severe problem, such as resource leaks, program\ncrashes. To make things worse, the users of LLM code generation services are\nactually the developers that are most vulnerable to these code that seems right\n-- They are always novice developers that are not familiar with the APIs that\nLLMs generate code for them. Therefore, they could hardly tell the misuse in\nthe code generated by LLMs, which further facilitates the incorrect code\napplied in real-world software. Existing code evaluation benchmark and datasets\nfocus on crafting small tasks such as programming questions in coding\ninterviews, which however deviates from the problem that developers would ask\nLLM for real-world coding help. To fill the missing piece, in this work, we\npropose a dataset RobustAPI for evaluating the reliability and robustness of\ncode generated by LLMs. We collect 1208 coding questions from StackOverflow on\n24 representative Java APIs. We summarize thecommon misuse patterns of these\nAPIs and evaluate them oncurrent popular LLMs. The evaluation results show that\nevenfor GPT-4, 62% of the generated code contains API misuses,which would cause\nunexpected consequences if the code isintroduced into real-world software.\n","authors":["Li Zhong","Zilong Wang"],"pdf_url":"https://arxiv.org/pdf/2308.10335v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09680v2","updated":"2023-10-17T05:44:03Z","published":"2023-10-14T23:16:05Z","title":"Improved Contextual Recognition In Automatic Speech Recognition Systems\n By Semantic Lattice Rescoring","summary":" Automatic Speech Recognition (ASR) has witnessed a profound research\ninterest. Recent breakthroughs have given ASR systems different prospects such\nas faithfully transcribing spoken language, which is a pivotal advancement in\nbuilding conversational agents. However, there is still an imminent challenge\nof accurately discerning context-dependent words and phrases. In this work, we\npropose a novel approach for enhancing contextual recognition within ASR\nsystems via semantic lattice processing leveraging the power of deep learning\nmodels in accurately delivering spot-on transcriptions across a wide variety of\nvocabularies and speaking styles. Our solution consists of using Hidden Markov\nModels and Gaussian Mixture Models (HMM-GMM) along with Deep Neural Networks\n(DNN) models integrating both language and acoustic modeling for better\naccuracy. We infused our network with the use of a transformer-based model to\nproperly rescore the word lattice achieving remarkable capabilities with a\npalpable reduction in Word Error Rate (WER). We demonstrate the effectiveness\nof our proposed framework on the LibriSpeech dataset with empirical analyses.\n","authors":["Ankitha Sudarshan","Vinay Samuel","Parth Patwa","Ibtihel Amara","Aman Chadha"],"pdf_url":"https://arxiv.org/pdf/2310.09680v2.pdf","comment":"arXiv admin note: text overlap with arXiv:2209.01250,\n arXiv:2301.06735 by other authors"},{"id":"http://arxiv.org/abs/2310.11003v1","updated":"2023-10-17T05:10:39Z","published":"2023-10-17T05:10:39Z","title":"Correction Focused Language Model Training for Speech Recognition","summary":" Language models (LMs) have been commonly adopted to boost the performance of\nautomatic speech recognition (ASR) particularly in domain adaptation tasks.\nConventional way of LM training treats all the words in corpora equally,\nresulting in suboptimal improvements in ASR performance. In this work, we\nintroduce a novel correction focused LM training approach which aims to\nprioritize ASR fallible words. The word-level ASR fallibility score,\nrepresenting the likelihood of ASR mis-recognition, is defined and shaped as a\nprior word distribution to guide the LM training. To enable correction focused\ntraining with text-only corpora, large language models (LLMs) are employed as\nfallibility score predictors and text generators through multi-task\nfine-tuning. Experimental results for domain adaptation tasks demonstrate the\neffectiveness of our proposed method. Compared with conventional LMs,\ncorrection focused training achieves up to relatively 5.5% word error rate\n(WER) reduction in sufficient text scenarios. In insufficient text scenarios,\nLM training with LLM-generated text achieves up to relatively 13% WER\nreduction, while correction focused training further obtains up to relatively\n6% WER reduction.\n","authors":["Yingyi Ma","Zhe Liu","Ozlem Kalinli"],"pdf_url":"https://arxiv.org/pdf/2310.11003v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.10819v2","updated":"2023-10-17T04:56:57Z","published":"2023-05-18T08:57:17Z","title":"CLEME: Debiasing Multi-reference Evaluation for Grammatical Error\n Correction","summary":" Evaluating the performance of Grammatical Error Correction (GEC) systems is a\nchallenging task due to its subjectivity. Designing an evaluation metric that\nis as objective as possible is crucial to the development of GEC task. However,\nmainstream evaluation metrics, i.e., reference-based metrics, introduce bias\ninto the multi-reference evaluation by extracting edits without considering the\npresence of multiple references. To overcome this issue, we propose Chunk-LEvel\nMulti-reference Evaluation (CLEME), designed to evaluate GEC systems in the\nmulti-reference evaluation setting. CLEME builds chunk sequences with\nconsistent boundaries for the source, the hypothesis and references, thus\neliminating the bias caused by inconsistent edit boundaries. Furthermore, we\nobserve the consistent boundary could also act as the boundary of grammatical\nerrors, based on which the F$_{0.5}$ score is then computed following the\ncorrection independence assumption. We conduct experiments on six English\nreference sets based on the CoNLL-2014 shared task. Extensive experiments and\ndetailed analyses demonstrate the correctness of our discovery and the\neffectiveness of CLEME. Further analysis reveals that CLEME is robust to\nevaluate GEC systems across reference sets with varying numbers of references\nand annotation style.\n","authors":["Jingheng Ye","Yinghui Li","Qingyu Zhou","Yangning Li","Shirong Ma","Hai-Tao Zheng","Ying Shen"],"pdf_url":"https://arxiv.org/pdf/2305.10819v2.pdf","comment":"Accepted as an EMNLP 2023 main paper"},{"id":"http://arxiv.org/abs/2310.10981v1","updated":"2023-10-17T04:03:00Z","published":"2023-10-17T04:03:00Z","title":"Instructive Dialogue Summarization with Query Aggregations","summary":" Conventional dialogue summarization methods directly generate summaries and\ndo not consider user's specific interests. This poses challenges in cases where\nthe users are more focused on particular topics or aspects. With the\nadvancement of instruction-finetuned language models, we introduce\ninstruction-tuning to dialogues to expand the capability set of dialogue\nsummarization models. To overcome the scarcity of instructive dialogue\nsummarization data, we propose a three-step approach to synthesize high-quality\nquery-based summarization triples. This process involves summary-anchored query\ngeneration, query filtering, and query-based summary generation. By training a\nunified model called InstructDS (Instructive Dialogue Summarization) on three\nsummarization datasets with multi-purpose instructive triples, we expand the\ncapability of dialogue summarization models. We evaluate our method on four\ndatasets, including dialogue summarization and dialogue reading comprehension.\nExperimental results show that our approach outperforms the state-of-the-art\nmodels and even models with larger sizes. Additionally, our model exhibits\nhigher generalizability and faithfulness, as confirmed by human subjective\nevaluations.\n","authors":["Bin Wang","Zhengyuan Liu","Nancy F. Chen"],"pdf_url":"https://arxiv.org/pdf/2310.10981v1.pdf","comment":"Accept to EMNLP 2023 Main Conference"},{"id":"http://arxiv.org/abs/2310.09909v2","updated":"2023-10-17T03:41:09Z","published":"2023-10-15T18:32:27Z","title":"Can GPT-4V(ision) Serve Medical Applications? Case Studies on GPT-4V for\n Multimodal Medical Diagnosis","summary":" Driven by the large foundation models, the development of artificial\nintelligence has witnessed tremendous progress lately, leading to a surge of\ngeneral interest from the public. In this study, we aim to assess the\nperformance of OpenAI's newest model, GPT-4V(ision), specifically in the realm\nof multimodal medical diagnosis. Our evaluation encompasses 17 human body\nsystems, including Central Nervous System, Head and Neck, Cardiac, Chest,\nHematology, Hepatobiliary, Gastrointestinal, Urogenital, Gynecology,\nObstetrics, Breast, Musculoskeletal, Spine, Vascular, Oncology, Trauma,\nPediatrics, with images taken from 8 modalities used in daily clinic routine,\ne.g., X-ray, Computed Tomography (CT), Magnetic Resonance Imaging (MRI),\nPositron Emission Tomography (PET), Digital Subtraction Angiography (DSA),\nMammography, Ultrasound, and Pathology. We probe the GPT-4V's ability on\nmultiple clinical tasks with or without patent history provided, including\nimaging modality and anatomy recognition, disease diagnosis, report generation,\ndisease localisation.\n Our observation shows that, while GPT-4V demonstrates proficiency in\ndistinguishing between medical image modalities and anatomy, it faces\nsignificant challenges in disease diagnosis and generating comprehensive\nreports. These findings underscore that while large multimodal models have made\nsignificant advancements in computer vision and natural language processing, it\nremains far from being used to effectively support real-world medical\napplications and clinical decision-making.\n All images used in this report can be found in\nhttps://github.com/chaoyi-wu/GPT-4V_Medical_Evaluation.\n","authors":["Chaoyi Wu","Jiayu Lei","Qiaoyu Zheng","Weike Zhao","Weixiong Lin","Xiaoman Zhang","Xiao Zhou","Ziheng Zhao","Ya Zhang","Yanfeng Wang","Weidi Xie"],"pdf_url":"https://arxiv.org/pdf/2310.09909v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.10070v2","updated":"2023-10-17T03:29:04Z","published":"2023-06-15T20:19:08Z","title":"Opportunities and Challenges for ChatGPT and Large Language Models in\n Biomedicine and Health","summary":" ChatGPT has drawn considerable attention from both the general public and\ndomain experts with its remarkable text generation capabilities. This has\nsubsequently led to the emergence of diverse applications in the field of\nbiomedicine and health. In this work, we examine the diverse applications of\nlarge language models (LLMs), such as ChatGPT, in biomedicine and health.\nSpecifically we explore the areas of biomedical information retrieval, question\nanswering, medical text summarization, information extraction, and medical\neducation, and investigate whether LLMs possess the transformative power to\nrevolutionize these tasks or whether the distinct complexities of biomedical\ndomain presents unique challenges. Following an extensive literature survey, we\nfind that significant advances have been made in the field of text generation\ntasks, surpassing the previous state-of-the-art methods. For other\napplications, the advances have been modest. Overall, LLMs have not yet\nrevolutionized biomedicine, but recent rapid progress indicates that such\nmethods hold great potential to provide valuable means for accelerating\ndiscovery and improving health. We also find that the use of LLMs, like\nChatGPT, in the fields of biomedicine and health entails various risks and\nchallenges, including fabricated information in its generated responses, as\nwell as legal and privacy concerns associated with sensitive patient data. We\nbelieve this survey can provide a comprehensive and timely overview to\nbiomedical researchers and healthcare practitioners on the opportunities and\nchallenges associated with using ChatGPT and other LLMs for transforming\nbiomedicine and health.\n","authors":["Shubo Tian","Qiao Jin","Lana Yeganova","Po-Ting Lai","Qingqing Zhu","Xiuying Chen","Yifan Yang","Qingyu Chen","Won Kim","Donald C. Comeau","Rezarta Islamaj","Aadit Kapoor","Xin Gao","Zhiyong Lu"],"pdf_url":"https://arxiv.org/pdf/2306.10070v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.10967v1","updated":"2023-10-17T03:28:29Z","published":"2023-10-17T03:28:29Z","title":"EXMODD: An EXplanatory Multimodal Open-Domain Dialogue dataset","summary":" The need for high-quality data has been a key issue hindering the research of\ndialogue tasks. Recent studies try to build datasets through manual, web\ncrawling, and large pre-trained models. However, man-made data is expensive and\ndata collected from the internet often includes generic responses, meaningless\nstatements, and toxic dialogues. Automatic data generation through large models\nis a cost-effective method, but for open-domain multimodal dialogue tasks,\nthere are still three drawbacks: 1) There is currently no open-source large\nmodel that can accept multimodal input; 2) The content generated by the model\nlacks interpretability; 3) The generated data is usually difficult to quality\ncontrol and require extensive resource to collect. To alleviate the significant\nhuman and resource expenditure in data collection, we propose a Multimodal Data\nConstruction Framework (MDCF). MDCF designs proper prompts to spur the\nlarge-scale pre-trained language model to generate well-formed and satisfactory\ncontent. Additionally, MDCF also automatically provides explanation for a given\nimage and its corresponding dialogue, which can provide a certain degree of\ninterpretability and facilitate manual follow-up quality inspection. Based on\nthis, we release an Explanatory Multimodal Open-Domain dialogue dataset\n(EXMODD). Experiments indicate a positive correlation between the model's\nability to generate accurate understandings and high-quality responses. Our\ncode and data can be found at https://github.com/poplpr/EXMODD.\n","authors":["Hang Yin","Pinren Lu","Ziang Li","Bin Sun","Kan Li"],"pdf_url":"https://arxiv.org/pdf/2310.10967v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.10962v1","updated":"2023-10-17T03:21:43Z","published":"2023-10-17T03:21:43Z","title":"Semantic-Aware Contrastive Sentence Representation Learning with Large\n Language Models","summary":" Contrastive learning has been proven to be effective in learning better\nsentence representations. However, to train a contrastive learning model, large\nnumbers of labeled sentences are required to construct positive and negative\npairs explicitly, such as those in natural language inference (NLI) datasets.\nUnfortunately, acquiring sufficient high-quality labeled data can be both\ntime-consuming and resource-intensive, leading researchers to focus on\ndeveloping methods for learning unsupervised sentence representations. As there\nis no clear relationship between these unstructured randomly-sampled sentences,\nbuilding positive and negative pairs over them is tricky and problematic. To\ntackle these challenges, in this paper, we propose SemCSR, a semantic-aware\ncontrastive sentence representation framework. By leveraging the generation and\nevaluation capabilities of large language models (LLMs), we can automatically\nconstruct a high-quality NLI-style corpus without any human annotation, and\nfurther incorporate the generated sentence pairs into learning a contrastive\nsentence representation model. Extensive experiments and comprehensive analyses\ndemonstrate the effectiveness of our proposed framework for learning a better\nsentence representation with LLMs.\n","authors":["Huiming Wang","Liying Cheng","Zhaodonghui Li","De Wen Soh","Lidong Bing"],"pdf_url":"https://arxiv.org/pdf/2310.10962v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.10956v1","updated":"2023-10-17T03:05:42Z","published":"2023-10-17T03:05:42Z","title":"Computing the optimal keyboard through a geometric analysis of the\n English language","summary":" In the context of a group project for the course COMSW4995 002 - Geometric\nData Analysis, we bring our attention to the design of fast-typing keyboards.\nLeveraging some geometric tools in an optimization framework allowed us to\npropose novel keyboard layouts that offer a faster typing.\n","authors":["Jules Deschamps","Quentin Hubert","Lucas Ryckelynck"],"pdf_url":"https://arxiv.org/pdf/2310.10956v1.pdf","comment":"17 pages"},{"id":"http://arxiv.org/abs/2310.10955v1","updated":"2023-10-17T03:05:06Z","published":"2023-10-17T03:05:06Z","title":"A State-Vector Framework for Dataset Effects","summary":" The impressive success of recent deep neural network (DNN)-based systems is\nsignificantly influenced by the high-quality datasets used in training.\nHowever, the effects of the datasets, especially how they interact with each\nother, remain underexplored. We propose a state-vector framework to enable\nrigorous studies in this direction. This framework uses idealized probing test\nresults as the bases of a vector space. This framework allows us to quantify\nthe effects of both standalone and interacting datasets. We show that the\nsignificant effects of some commonly-used language understanding datasets are\ncharacteristic and are concentrated on a few linguistic dimensions.\nAdditionally, we observe some ``spill-over'' effects: the datasets could impact\nthe models along dimensions that may seem unrelated to the intended tasks. Our\nstate-vector framework paves the way for a systematic understanding of the\ndataset effects, a crucial component in responsible and robust model\ndevelopment.\n","authors":["Esmat Sahak","Zining Zhu","Frank Rudzicz"],"pdf_url":"https://arxiv.org/pdf/2310.10955v1.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.10944v1","updated":"2023-10-17T02:42:34Z","published":"2023-10-17T02:42:34Z","title":"TEQ: Trainable Equivalent Transformation for Quantization of LLMs","summary":" As large language models (LLMs) become more prevalent, there is a growing\nneed for new and improved quantization methods that can meet the\ncomputationalast layer demands of these modern architectures while maintaining\nthe accuracy. In this paper, we present TEQ, a trainable equivalent\ntransformation that preserves the FP32 precision of the model output while\ntaking advantage of low-precision quantization, especially 3 and 4 bits\nweight-only quantization. The training process is lightweight, requiring only\n1K steps and fewer than 0.1 percent of the original model's trainable\nparameters. Furthermore, the transformation does not add any computational\noverhead during inference. Our results are on-par with the state-of-the-art\n(SOTA) methods on typical LLMs. Our approach can be combined with other methods\nto achieve even better performance. The code is available at\nhttps://github.com/intel/neural-compressor.\n","authors":["Wenhua Cheng","Yiyang Cai","Kaokao Lv","Haihao Shen"],"pdf_url":"https://arxiv.org/pdf/2310.10944v1.pdf","comment":"10 pages, 3 figures"},{"id":"http://arxiv.org/abs/2304.01492v5","updated":"2023-10-17T02:37:46Z","published":"2023-04-04T03:13:03Z","title":"A Unified Contrastive Transfer Framework with Propagation Structure for\n Boosting Low-Resource Rumor Detection","summary":" The truth is significantly hampered by massive rumors that spread along with\nbreaking news or popular topics. Since there is sufficient corpus gathered from\nthe same domain for model training, existing rumor detection algorithms show\npromising performance on yesterday's news. However, due to a lack of\nsubstantial training data and prior expert knowledge, they are poor at spotting\nrumors concerning unforeseen events, especially those propagated in different\nlanguages (i.e., low-resource regimes). In this paper, we propose a unified\ncontrastive transfer framework to detect rumors by adapting the features\nlearned from well-resourced rumor data to that of the low-resourced with only\nfew-shot annotations. More specifically, we first represent rumor circulated on\nsocial media as an undirected topology for enhancing the interaction of user\nopinions, and then train a Multi-scale Graph Convolutional Network via a\nunified contrastive paradigm to mine effective clues simultaneously from post\nsemantics and propagation structure. Our model explicitly breaks the barriers\nof the domain and/or language issues, via language alignment and a novel\ndomain-adaptive contrastive learning mechanism. To well-generalize the\nrepresentation learning using a small set of annotated target events, we reveal\nthat rumor-indicative signal is closely correlated with the uniformity of the\ndistribution of these events. We design a target-wise contrastive training\nmechanism with three event-level data augmentation strategies, capable of\nunifying the representations by distinguishing target events. Extensive\nexperiments conducted on four low-resource datasets collected from real-world\nmicroblog platforms demonstrate that our framework achieves much better\nperformance than state-of-the-art methods and exhibits a superior capacity for\ndetecting rumors at early stages.\n","authors":["Hongzhan Lin","Jing Ma","Ruichao Yang","Zhiwei Yang","Mingfei Cheng"],"pdf_url":"https://arxiv.org/pdf/2304.01492v5.pdf","comment":"An extension of the first contrastive approach for low-resource rumor\n detection (arXiv:2204.08143)"},{"id":"http://arxiv.org/abs/2310.10941v1","updated":"2023-10-17T02:34:34Z","published":"2023-10-17T02:34:34Z","title":"MASON-NLP at eRisk 2023: Deep Learning-Based Detection of Depression\n Symptoms from Social Media Texts","summary":" Depression is a mental health disorder that has a profound impact on people's\nlives. Recent research suggests that signs of depression can be detected in the\nway individuals communicate, both through spoken words and written texts. In\nparticular, social media posts are a rich and convenient text source that we\nmay examine for depressive symptoms. The Beck Depression Inventory (BDI)\nQuestionnaire, which is frequently used to gauge the severity of depression, is\none instrument that can aid in this study. We can narrow our study to only\nthose symptoms since each BDI question is linked to a particular depressive\nsymptom. It's important to remember that not everyone with depression exhibits\nall symptoms at once, but rather a combination of them. Therefore, it is\nextremely useful to be able to determine if a sentence or a piece of\nuser-generated content is pertinent to a certain condition. With this in mind,\nthe eRisk 2023 Task 1 was designed to do exactly that: assess the relevance of\ndifferent sentences to the symptoms of depression as outlined in the BDI\nquestionnaire. This report is all about how our team, Mason-NLP, participated\nin this subtask, which involved identifying sentences related to different\ndepression symptoms. We used a deep learning approach that incorporated\nMentalBERT, RoBERTa, and LSTM. Despite our efforts, the evaluation results were\nlower than expected, underscoring the challenges inherent in ranking sentences\nfrom an extensive dataset about depression, which necessitates both appropriate\nmethodological choices and significant computational resources. We anticipate\nthat future iterations of this shared task will yield improved results as our\nunderstanding and techniques evolve.\n","authors":["Fardin Ahsan Sakib","Ahnaf Atef Choudhury","Ozlem Uzuner"],"pdf_url":"https://arxiv.org/pdf/2310.10941v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.10935v1","updated":"2023-10-17T02:12:12Z","published":"2023-10-17T02:12:12Z","title":"Intent Detection and Slot Filling for Home Assistants: Dataset and\n Analysis for Bangla and Sylheti","summary":" As voice assistants cement their place in our technologically advanced\nsociety, there remains a need to cater to the diverse linguistic landscape,\nincluding colloquial forms of low-resource languages. Our study introduces the\nfirst-ever comprehensive dataset for intent detection and slot filling in\nformal Bangla, colloquial Bangla, and Sylheti languages, totaling 984 samples\nacross 10 unique intents. Our analysis reveals the robustness of large language\nmodels for tackling downstream tasks with inadequate data. The GPT-3.5 model\nachieves an impressive F1 score of 0.94 in intent detection and 0.51 in slot\nfilling for colloquial Bangla.\n","authors":["Fardin Ahsan Sakib","A H M Rezaul Karim","Saadat Hasan Khan","Md Mushfiqur Rahman"],"pdf_url":"https://arxiv.org/pdf/2310.10935v1.pdf","comment":"Accepted at the First Workshop on Bangla Language Processing, 2023"},{"id":"http://arxiv.org/abs/2310.09430v2","updated":"2023-10-17T02:08:24Z","published":"2023-10-13T22:29:15Z","title":"A Systematic Evaluation of Large Language Models on Out-of-Distribution\n Logical Reasoning Tasks","summary":" Large language models (LLMs), such as GPT-3.5 and GPT-4, have greatly\nadvanced the performance of artificial systems on various natural language\nprocessing tasks to human-like levels. However, their generalisation and\nrobustness to perform logical reasoning remain under-evaluated. To probe this\nability, we propose three new logical reasoning datasets named \"ReClor-plus\",\n\"LogiQA-plus\" and \"LogiQAv2-plus\", each featuring three subsets: the first with\nrandomly shuffled options, the second with the correct choices replaced by\n\"none of the other options are correct\", and a combination of the previous two\nsubsets. We carry out experiments on these datasets with both discriminative\nand generative LLMs and show that these simple tricks greatly hinder the\nperformance of the language models. Despite their superior performance on the\noriginal publicly available datasets, we find that all models struggle to\nanswer our newly constructed datasets. We show that introducing task variations\nby perturbing a sizable training set can markedly improve the model's\ngeneralisation and robustness in logical reasoning tasks. Moreover, applying\nlogic-driven data augmentation for fine-tuning, combined with prompting can\nenhance the generalisation performance of both discriminative large language\nmodels and generative large language models. These results offer insights into\nassessing and improving the generalisation and robustness of large language\nmodels for logical reasoning tasks. We make our source code and data publicly\navailable\n\\url{https://github.com/Strong-AI-Lab/Logical-and-abstract-reasoning}.\n","authors":["Qiming Bao","Gael Gendron","Alex Yuxuan Peng","Wanjun Zhong","Neset Tan","Yang Chen","Michael Witbrock","Jiamou Liu"],"pdf_url":"https://arxiv.org/pdf/2310.09430v2.pdf","comment":"Accepted for oral presentation at the LLM@IJCAI 2023 non-archival\n symposium"},{"id":"http://arxiv.org/abs/2310.10930v1","updated":"2023-10-17T01:59:07Z","published":"2023-10-17T01:59:07Z","title":"Enhanced Transformer Architecture for Natural Language Processing","summary":" Transformer is a state-of-the-art model in the field of natural language\nprocessing (NLP). Current NLP models primarily increase the number of\ntransformers to improve processing performance. However, this technique\nrequires a lot of training resources such as computing capacity. In this paper,\na novel structure of Transformer is proposed. It is featured by full layer\nnormalization, weighted residual connection, positional encoding exploiting\nreinforcement learning, and zero masked self-attention. The proposed\nTransformer model, which is called Enhanced Transformer, is validated by the\nbilingual evaluation understudy (BLEU) score obtained with the Multi30k\ntranslation dataset. As a result, the Enhanced Transformer achieves 202.96%\nhigher BLEU score as compared to the original transformer with the translation\ndataset.\n","authors":["Woohyeon Moon","Taeyoung Kim","Bumgeun Park","Dongsoo Har"],"pdf_url":"https://arxiv.org/pdf/2310.10930v1.pdf","comment":"11 pages"},{"id":"http://arxiv.org/abs/2310.08659v2","updated":"2023-10-17T01:35:10Z","published":"2023-10-12T18:34:08Z","title":"LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models","summary":" Quantization is an indispensable technique for serving Large Language Models\n(LLMs) and has recently found its way into LoRA fine-tuning. In this work we\nfocus on the scenario where quantization and LoRA fine-tuning are applied\ntogether on a pre-trained model. In such cases it is common to observe a\nconsistent gap in the performance on downstream tasks between full fine-tuning\nand quantization plus LoRA fine-tuning approach. In response, we propose LoftQ\n(LoRA-Fine-Tuning-aware Quantization), a novel quantization framework that\nsimultaneously quantizes an LLM and finds a proper low-rank initialization for\nLoRA fine-tuning. Such an initialization alleviates the discrepancy between the\nquantized and full-precision model and significantly improves the\ngeneralization in downstream tasks. We evaluate our method on natural language\nunderstanding, question answering, summarization, and natural language\ngeneration tasks. Experiments show that our method is highly effective and\noutperforms existing quantization methods, especially in the challenging 2-bit\nand 2/4-bit mixed precision regimes. We will release our code.\n","authors":["Yixiao Li","Yifan Yu","Chen Liang","Pengcheng He","Nikos Karampatziakis","Weizhu Chen","Tuo Zhao"],"pdf_url":"https://arxiv.org/pdf/2310.08659v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.10922v1","updated":"2023-10-17T01:31:59Z","published":"2023-10-17T01:31:59Z","title":"Spatial HuBERT: Self-supervised Spatial Speech Representation Learning\n for a Single Talker from Multi-channel Audio","summary":" Self-supervised learning has been used to leverage unlabelled data, improving\naccuracy and generalisation of speech systems through the training of\nrepresentation models. While many recent works have sought to produce effective\nrepresentations across a variety of acoustic domains, languages, modalities and\neven simultaneous speakers, these studies have all been limited to\nsingle-channel audio recordings. This paper presents Spatial HuBERT, a\nself-supervised speech representation model that learns both acoustic and\nspatial information pertaining to a single speaker in a potentially noisy\nenvironment by using multi-channel audio inputs. Spatial HuBERT learns\nrepresentations that outperform state-of-the-art single-channel speech\nrepresentations on a variety of spatial downstream tasks, particularly in\nreverberant and noisy environments. We also demonstrate the utility of the\nrepresentations learned by Spatial HuBERT on a speech localisation downstream\ntask. Along with this paper, we publicly release a new dataset of 100 000\nsimulated first-order ambisonics room impulse responses.\n","authors":["Antoni Dimitriadis","Siqi Pan","Vidhyasaharan Sethu","Beena Ahmed"],"pdf_url":"https://arxiv.org/pdf/2310.10922v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.04965v2","updated":"2023-10-17T01:30:57Z","published":"2023-09-10T08:55:24Z","title":"Prefix-diffusion: A Lightweight Diffusion Model for Diverse Image\n Captioning","summary":" While impressive performance has been achieved in image captioning, the\nlimited diversity of the generated captions and the large parameter scale\nremain major barriers to the real-word application of these systems. In this\nwork, we propose a lightweight image captioning network in combination with\ncontinuous diffusion, called Prefix-diffusion. To achieve diversity, we design\nan efficient method that injects prefix image embeddings into the denoising\nprocess of the diffusion model. In order to reduce trainable parameters, we\nemploy a pre-trained model to extract image features and further design an\nextra mapping network. Prefix-diffusion is able to generate diverse captions\nwith relatively less parameters, while maintaining the fluency and relevance of\nthe captions benefiting from the generative capabilities of the diffusion\nmodel. Our work paves the way for scaling up diffusion models for image\ncaptioning, and achieves promising performance compared with recent approaches.\n","authors":["Guisheng Liu","Yi Li","Zhengcong Fei","Haiyan Fu","Xiangyang Luo","Yanqing Guo"],"pdf_url":"https://arxiv.org/pdf/2309.04965v2.pdf","comment":"11 pages,4 figures, 6 tables"},{"id":"http://arxiv.org/abs/2310.10920v1","updated":"2023-10-17T01:27:20Z","published":"2023-10-17T01:27:20Z","title":"NuclearQA: A Human-Made Benchmark for Language Models for the Nuclear\n Domain","summary":" As LLMs have become increasingly popular, they have been used in almost every\nfield. But as the application for LLMs expands from generic fields to narrow,\nfocused science domains, there exists an ever-increasing gap in ways to\nevaluate their efficacy in those fields. For the benchmarks that do exist, a\nlot of them focus on questions that don't require proper understanding of the\nsubject in question. In this paper, we present NuclearQA, a human-made\nbenchmark of 100 questions to evaluate language models in the nuclear domain,\nconsisting of a varying collection of questions that have been specifically\ndesigned by experts to test the abilities of language models. We detail our\napproach and show how the mix of several types of questions makes our benchmark\nuniquely capable of evaluating models in the nuclear domain. We also present\nour own evaluation metric for assessing LLM's performances due to the\nlimitations of existing ones. Our experiments on state-of-the-art models\nsuggest that even the best LLMs perform less than satisfactorily on our\nbenchmark, demonstrating the scientific knowledge gap of existing LLMs.\n","authors":["Anurag Acharya","Sai Munikoti","Aaron Hellinger","Sara Smith","Sridevi Wagle","Sameera Horawalavithana"],"pdf_url":"https://arxiv.org/pdf/2310.10920v1.pdf","comment":"9 pages"},{"id":"http://arxiv.org/abs/2310.10903v1","updated":"2023-10-17T00:22:10Z","published":"2023-10-17T00:22:10Z","title":"Emergent AI-Assisted Discourse: Case Study of a Second Language Writer\n Authoring with ChatGPT","summary":" The rapid proliferation of ChatGPT has incited debates regarding its impact\non human writing. Amid concerns about declining writing standards, this study\ninvestigates the role of ChatGPT in facilitating academic writing, especially\namong language learners. Using a case study approach, this study examines the\nexperiences of Kailing, a doctoral student, who integrates ChatGPT throughout\ntheir academic writing process. The study employs activity theory as a lens for\nunderstanding writing with generative AI tools and data analyzed includes\nsemi-structured interviews, writing samples, and GPT logs. Results indicate\nthat Kailing effectively collaborates with ChatGPT across various writing\nstages while preserving her distinct authorial voice and agency. This\nunderscores the potential of AI tools such as ChatGPT to enhance academic\nwriting for language learners without overshadowing individual authenticity.\nThis case study offers a critical exploration of how ChatGPT is utilized in the\nacademic writing process and the preservation of a student's authentic voice\nwhen engaging with the tool.\n","authors":["Sharin Jacob","Tamara Tate","Mark Warschauer"],"pdf_url":"https://arxiv.org/pdf/2310.10903v1.pdf","comment":"24 pages"},{"id":"http://arxiv.org/abs/2310.11628v1","updated":"2023-10-17T23:34:39Z","published":"2023-10-17T23:34:39Z","title":"Learn Your Tokens: Word-Pooled Tokenization for Language Modeling","summary":" Language models typically tokenize text into subwords, using a deterministic,\nhand-engineered heuristic of combining characters into longer surface-level\nstrings such as 'ing' or whole words. Recent literature has repeatedly shown\nthe limitations of such a tokenization strategy, particularly for documents not\nwritten in English and for representing numbers. On the other extreme,\nbyte/character-level language models are much less restricted but suffer from\nincreased sequence description lengths and a subsequent quadratic expansion in\nself-attention computation. Recent attempts to compress and limit these context\nlengths with fixed size convolutions is helpful but completely ignores the word\nboundary. This paper considers an alternative 'learn your tokens' scheme which\nutilizes the word boundary to pool bytes/characters into word representations,\nwhich are fed to the primary language model, before again decoding individual\ncharacters/bytes per word in parallel. We find that our moderately expressive\nand moderately fast end-to-end tokenizer outperform by over 300% both subwords\nand byte/character models over the intrinsic language modeling metric of\nnext-word prediction across datasets. It particularly outshines on rare words,\noutperforming by a factor of 30! We extensively study the language modeling\nsetup for all three categories of tokenizers and theoretically analyze how our\nend-to-end models can also be a strong trade-off in efficiency and robustness.\n","authors":["Avijit Thawani","Saurabh Ghanekar","Xiaoyuan Zhu","Jay Pujara"],"pdf_url":"https://arxiv.org/pdf/2310.11628v1.pdf","comment":"Accepted to EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2310.05295v2","updated":"2023-10-17T22:43:08Z","published":"2023-10-08T21:45:34Z","title":"Visual Storytelling with Question-Answer Plans","summary":" Visual storytelling aims to generate compelling narratives from image\nsequences. Existing models often focus on enhancing the representation of the\nimage sequence, e.g., with external knowledge sources or advanced graph\nstructures. Despite recent progress, the stories are often repetitive,\nillogical, and lacking in detail. To mitigate these issues, we present a novel\nframework which integrates visual representations with pretrained language\nmodels and planning. Our model translates the image sequence into a visual\nprefix, a sequence of continuous embeddings which language models can\ninterpret. It also leverages a sequence of question-answer pairs as a blueprint\nplan for selecting salient visual concepts and determining how they should be\nassembled into a narrative. Automatic and human evaluation on the VIST\nbenchmark (Huang et al., 2016) demonstrates that blueprint-based models\ngenerate stories that are more coherent, interesting, and natural compared to\ncompetitive baselines and state-of-the-art systems.\n","authors":["Danyang Liu","Mirella Lapata","Frank Keller"],"pdf_url":"https://arxiv.org/pdf/2310.05295v2.pdf","comment":"EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2310.11616v1","updated":"2023-10-17T22:42:12Z","published":"2023-10-17T22:42:12Z","title":"Unveiling the General Intelligence Factor in Language Models: A\n Psychometric Approach","summary":" This study uncovers the factor of general intelligence, or g, in language\nmodels, extending the psychometric theory traditionally applied to humans and\ncertain animal species. Utilizing factor analysis on two extensive datasets -\nOpen LLM Leaderboard with 1,232 models and General Language Understanding\nEvaluation (GLUE) Leaderboard with 88 models - we find compelling evidence for\na unidimensional, highly stable g factor that accounts for 85% of the variance\nin model performance. The study also finds a moderate correlation of .48\nbetween model size and g. The discovery of g in language models offers a\nunified metric for model evaluation and opens new avenues for more robust,\ng-based model ability assessment. These findings lay the foundation for\nunderstanding and future research on artificial general intelligence from a\npsychometric perspective and have practical implications for model evaluation\nand development.\n","authors":["David Ilić"],"pdf_url":"https://arxiv.org/pdf/2310.11616v1.pdf","comment":"10 pages (including appendix), 7 figures"},{"id":"http://arxiv.org/abs/2310.09166v2","updated":"2023-10-17T22:37:58Z","published":"2023-10-13T15:01:17Z","title":"Developing a Natural Language Understanding Model to Characterize Cable\n News Bias","summary":" Media bias has been extensively studied by both social and computational\nsciences. However, current work still has a large reliance on human input and\nsubjective assessment to label biases. This is especially true for cable news\nresearch. To address these issues, we develop an unsupervised machine learning\nmethod to characterize the bias of cable news programs without any human input.\nThis method relies on the analysis of what topics are mentioned through Named\nEntity Recognition and how those topics are discussed through Stance Analysis\nin order to cluster programs with similar biases together. Applying our method\nto 2020 cable news transcripts, we find that program clusters are consistent\nover time and roughly correspond to the cable news network of the program. This\nmethod reveals the potential for future tools to objectively assess media bias\nand characterize unfamiliar media environments.\n","authors":["Seth P. Benson","Iain J. Cruickshank"],"pdf_url":"https://arxiv.org/pdf/2310.09166v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.10764v2","updated":"2023-10-17T22:15:38Z","published":"2022-12-21T04:49:55Z","title":"Learning List-Level Domain-Invariant Representations for Ranking","summary":" Domain adaptation aims to transfer the knowledge learned on (data-rich)\nsource domains to (low-resource) target domains, and a popular method is\ninvariant representation learning, which matches and aligns the data\ndistributions on the feature space. Although this method is studied extensively\nand applied on classification and regression problems, its adoption on ranking\nproblems is sporadic, and the few existing implementations lack theoretical\njustifications. This paper revisits invariant representation learning for\nranking. Upon reviewing prior work, we found that they implement what we call\nitem-level alignment, which aligns the distributions of the items being ranked\nfrom all lists in aggregate but ignores their list structure. However, the list\nstructure should be leveraged, because it is intrinsic to ranking problems\nwhere the data and the metrics are defined and computed on lists, not the items\nby themselves. To close this discrepancy, we propose list-level alignment --\nlearning domain-invariant representations at the higher level of lists. The\nbenefits are twofold: it leads to the first domain adaptation generalization\nbound for ranking, in turn providing theoretical support for the proposed\nmethod, and it achieves better empirical transfer performance for unsupervised\ndomain adaptation on ranking tasks, including passage reranking.\n","authors":["Ruicheng Xian","Honglei Zhuang","Zhen Qin","Hamed Zamani","Jing Lu","Ji Ma","Kai Hui","Han Zhao","Xuanhui Wang","Michael Bendersky"],"pdf_url":"https://arxiv.org/pdf/2212.10764v2.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2302.09473v2","updated":"2023-10-17T22:01:00Z","published":"2023-02-19T04:03:22Z","title":"Video-Text Retrieval by Supervised Sparse Multi-Grained Learning","summary":" While recent progress in video-text retrieval has been advanced by the\nexploration of better representation learning, in this paper, we present a\nnovel multi-grained sparse learning framework, S3MA, to learn an aligned sparse\nspace shared between the video and the text for video-text retrieval. The\nshared sparse space is initialized with a finite number of sparse concepts,\neach of which refers to a number of words. With the text data at hand, we learn\nand update the shared sparse space in a supervised manner using the proposed\nsimilarity and alignment losses. Moreover, to enable multi-grained alignment,\nwe incorporate frame representations for better modeling the video modality and\ncalculating fine-grained and coarse-grained similarities. Benefiting from the\nlearned shared sparse space and multi-grained similarities, extensive\nexperiments on several video-text retrieval benchmarks demonstrate the\nsuperiority of S3MA over existing methods. Our code is available at\nhttps://github.com/yimuwangcs/Better_Cross_Modal_Retrieval.\n","authors":["Yimu Wang","Peng Shi"],"pdf_url":"https://arxiv.org/pdf/2302.09473v2.pdf","comment":"Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.11604v1","updated":"2023-10-17T21:57:36Z","published":"2023-10-17T21:57:36Z","title":"Language Models as Zero-Shot Trajectory Generators","summary":" Large Language Models (LLMs) have recently shown promise as high-level\nplanners for robots when given access to a selection of low-level skills.\nHowever, it is often assumed that LLMs do not possess sufficient knowledge to\nbe used for the low-level trajectories themselves. In this work, we address\nthis assumption thoroughly, and investigate if an LLM (GPT-4) can directly\npredict a dense sequence of end-effector poses for manipulation skills, when\ngiven access to only object detection and segmentation vision models. We study\nhow well a single task-agnostic prompt, without any in-context examples, motion\nprimitives, or external trajectory optimisers, can perform across 26 real-world\nlanguage-based tasks, such as \"open the bottle cap\" and \"wipe the plate with\nthe sponge\", and we investigate which design choices in this prompt are the\nmost effective. Our conclusions raise the assumed limit of LLMs for robotics,\nand we reveal for the first time that LLMs do indeed possess an understanding\nof low-level robot control sufficient for a range of common tasks, and that\nthey can additionally detect failures and then re-plan trajectories\naccordingly. Videos, code, and prompts are available at:\nhttps://www.robot-learning.uk/language-models-trajectory-generators.\n","authors":["Teyun Kwon","Norman Di Palo","Edward Johns"],"pdf_url":"https://arxiv.org/pdf/2310.11604v1.pdf","comment":"19 pages, 21 figures"},{"id":"http://arxiv.org/abs/2310.11593v1","updated":"2023-10-17T21:35:06Z","published":"2023-10-17T21:35:06Z","title":"Automated Evaluation of Personalized Text Generation using Large\n Language Models","summary":" Personalized text generation presents a specialized mechanism for delivering\ncontent that is specific to a user's personal context. While the research\nprogress in this area has been rapid, evaluation still presents a challenge.\nTraditional automated metrics such as BLEU and ROUGE primarily measure lexical\nsimilarity to human-written references, and are not able to distinguish\npersonalization from other subtle semantic aspects, thus falling short of\ncapturing the nuances of personalized generated content quality. On the other\nhand, human judgments are costly to obtain, especially in the realm of\npersonalized evaluation. Inspired by these challenges, we explore the use of\nlarge language models (LLMs) for evaluating personalized text generation, and\nexamine their ability to understand nuanced user context. We present AuPEL, a\nnovel evaluation method that distills three major semantic aspects of the\ngenerated text: personalization, quality and relevance, and automatically\nmeasures these aspects. To validate the effectiveness of AuPEL, we design\ncarefully controlled experiments and compare the accuracy of the evaluation\njudgments made by LLMs versus that of judgements made by human annotators, and\nconduct rigorous analyses of the consistency and sensitivity of the proposed\nmetric. We find that, compared to existing evaluation metrics, AuPEL not only\ndistinguishes and ranks models based on their personalization abilities more\naccurately, but also presents commendable consistency and efficiency for this\ntask. Our work suggests that using LLMs as the evaluators of personalized text\ngeneration is superior to traditional text similarity metrics, even though\ninteresting new challenges still remain.\n","authors":["Yaqing Wang","Jiepu Jiang","Mingyang Zhang","Cheng Li","Yi Liang","Qiaozhu Mei","Michael Bendersky"],"pdf_url":"https://arxiv.org/pdf/2310.11593v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11589v1","updated":"2023-10-17T21:11:21Z","published":"2023-10-17T21:11:21Z","title":"Eliciting Human Preferences with Language Models","summary":" Language models (LMs) can be directed to perform target tasks by using\nlabeled examples or natural language prompts. But selecting examples or writing\nprompts for can be challenging--especially in tasks that involve unusual edge\ncases, demand precise articulation of nebulous preferences, or require an\naccurate mental model of LM behavior. We propose to use *LMs themselves* to\nguide the task specification process. In this paper, we introduce **Generative\nActive Task Elicitation (GATE)**: a learning framework in which models elicit\nand infer intended behavior through free-form, language-based interaction with\nusers. We study GATE in three domains: email validation, content\nrecommendation, and moral reasoning. In preregistered experiments, we show that\nLMs prompted to perform GATE (e.g., by generating open-ended questions or\nsynthesizing informative edge cases) elicit responses that are often more\ninformative than user-written prompts or labels. Users report that interactive\ntask elicitation requires less effort than prompting or example labeling and\nsurfaces novel considerations not initially anticipated by users. Our findings\nsuggest that LM-driven elicitation can be a powerful tool for aligning models\nto complex human preferences and values.\n","authors":["Belinda Z. Li","Alex Tamkin","Noah Goodman","Jacob Andreas"],"pdf_url":"https://arxiv.org/pdf/2310.11589v1.pdf","comment":"26 pages, 15 figures"},{"id":"http://arxiv.org/abs/2310.11584v1","updated":"2023-10-17T21:05:20Z","published":"2023-10-17T21:05:20Z","title":"BasahaCorpus: An Expanded Linguistic Resource for Readability Assessment\n in Central Philippine Languages","summary":" Current research on automatic readability assessment (ARA) has focused on\nimproving the performance of models in high-resource languages such as English.\nIn this work, we introduce and release BasahaCorpus as part of an initiative\naimed at expanding available corpora and baseline models for readability\nassessment in lower resource languages in the Philippines. We compiled a corpus\nof short fictional narratives written in Hiligaynon, Minasbate, Karay-a, and\nRinconada -- languages belonging to the Central Philippine family tree subgroup\n-- to train ARA models using surface-level, syllable-pattern, and n-gram\noverlap features. We also propose a new hierarchical cross-lingual modeling\napproach that takes advantage of a language's placement in the family tree to\nincrease the amount of available training data. Our study yields encouraging\nresults that support previous work showcasing the efficacy of cross-lingual\nmodels in low-resource settings, as well as similarities in highly informative\nlinguistic features for mutually intelligible languages.\n","authors":["Joseph Marvin Imperial","Ekaterina Kochmar"],"pdf_url":"https://arxiv.org/pdf/2310.11584v1.pdf","comment":"Final camera-ready paper for EMNLP 2023 (Main)"},{"id":"http://arxiv.org/abs/2305.14577v2","updated":"2023-10-17T21:03:10Z","published":"2023-05-23T23:31:02Z","title":"Difference-Masking: Choosing What to Mask in Continued Pretraining","summary":" The self-supervised objective of masking-and-predicting has led to promising\nperformance gains on a variety of downstream tasks. However, while most\napproaches randomly mask tokens, there is strong intuition that deciding what\nto mask can substantially improve learning outcomes. We investigate this in\ncontinued pretraining setting in which pretrained models continue to pretrain\non domain-specific data before performing some downstream task. We introduce\nDifference-Masking, a masking strategy that automatically chooses what to mask\nduring continued pretraining by considering what makes a task domain different\nfrom the pretraining domain. Empirically, we find that Difference-Masking\noutperforms baselines on continued pretraining settings across four diverse\nlanguage-only and multimodal video tasks.\n","authors":["Alex Wilf","Syeda Nahida Akter","Leena Mathur","Paul Pu Liang","Sheryl Mathew","Mengrou Shou","Eric Nyberg","Louis-Philippe Morency"],"pdf_url":"https://arxiv.org/pdf/2305.14577v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11571v1","updated":"2023-10-17T20:40:59Z","published":"2023-10-17T20:40:59Z","title":"What is a good question? Task-oriented asking with fact-level masking","summary":" Asking questions is an important element of real-life collaboration on\nreasoning tasks like question answering. For example, a legal assistant chatbot\nmay be unable to make accurate recommendations without specific information on\nthe user's circumstances. However, large language models are usually deployed\nto solve reasoning tasks directly without asking follow-up questions to the\nuser or third parties. We term this problem task-oriented asking (TOA).\nZero-shot chat models can perform TOA, but their training is primarily based on\nnext-token prediction rather than whether questions contribute to successful\ncollaboration. To enable the training and evaluation of TOA models, we present\na definition and framework for natural language task-oriented asking, the\nproblem of generating questions that result in answers useful for a reasoning\ntask. We also present fact-level masking (FLM), a procedure for converting\nnatural language datasets into self-supervised TOA datasets by omitting\nparticular critical facts. Finally, we generate a TOA dataset from the HotpotQA\ndataset using FLM and evaluate several zero-shot language models on it. Our\nexperiments show that current zero-shot models struggle to ask questions that\nretrieve useful information, as compared to human annotators. These results\ndemonstrate an opportunity to use FLM datasets and the TOA framework to train\nand evaluate better TOA models.\n","authors":["Matthew Toles","Yukun Huang","Zhou Yu","Luis Gravano"],"pdf_url":"https://arxiv.org/pdf/2310.11571v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11564v1","updated":"2023-10-17T20:22:13Z","published":"2023-10-17T20:22:13Z","title":"Personalized Soups: Personalized Large Language Model Alignment via\n Post-hoc Parameter Merging","summary":" While Reinforcement Learning from Human Feedback (RLHF) aligns Large Language\nModels (LLMs) with general, aggregate human preferences, it is suboptimal for\nlearning diverse, individual perspectives. In this work, we study Reinforcement\nLearning from Personalized Human Feedback (RLPHF) problem, wherein LLMs are\naligned to multiple (sometimes conflicting) preferences by modeling alignment\nas a Multi-Objective Reinforcement Learning (MORL) problem. Compared to strong\nsingle-objective baselines, we show that we can achieve personalized alignment\nby decomposing preferences into multiple dimensions. These dimensions are\ndefined based on personalizations that are declared as desirable by the user.\nIn this work, we show that they can be efficiently trained independently in a\ndistributed manner and combined effectively post-hoc through parameter merging.\nThe code is available at https://github.com/joeljang/RLPHF.\n","authors":["Joel Jang","Seungone Kim","Bill Yuchen Lin","Yizhong Wang","Jack Hessel","Luke Zettlemoyer","Hannaneh Hajishirzi","Yejin Choi","Prithviraj Ammanabrolu"],"pdf_url":"https://arxiv.org/pdf/2310.11564v1.pdf","comment":"Preprint"},{"id":"http://arxiv.org/abs/2310.10378v2","updated":"2023-10-17T19:56:41Z","published":"2023-10-16T13:19:17Z","title":"Cross-Lingual Consistency of Factual Knowledge in Multilingual Language\n Models","summary":" Multilingual large-scale Pretrained Language Models (PLMs) have been shown to\nstore considerable amounts of factual knowledge, but large variations are\nobserved across languages. With the ultimate goal of ensuring that users with\ndifferent language backgrounds obtain consistent feedback from the same model,\nwe study the cross-lingual consistency (CLC) of factual knowledge in various\nmultilingual PLMs. To this end, we propose a Ranking-based Consistency (RankC)\nmetric to evaluate knowledge consistency across languages independently from\naccuracy. Using this metric, we conduct an in-depth analysis of the determining\nfactors for CLC, both at model level and at language-pair level. Among other\nresults, we find that increasing model size leads to higher factual probing\naccuracy in most languages, but does not improve cross-lingual consistency.\nFinally, we conduct a case study on CLC when new factual associations are\ninserted in the PLMs via model editing. Results on a small sample of facts\ninserted in English reveal a clear pattern whereby the new piece of knowledge\ntransfers only to languages with which English has a high RankC score.\n","authors":["Jirui Qi","Raquel Fernández","Arianna Bisazza"],"pdf_url":"https://arxiv.org/pdf/2310.10378v2.pdf","comment":"Accepted at EMNLP2023 main conference. All code and data are released\n at https://github.com/Betswish/Cross-Lingual-Consistency"},{"id":"http://arxiv.org/abs/2310.10449v2","updated":"2023-10-17T19:54:16Z","published":"2023-10-16T14:33:02Z","title":"Text Summarization Using Large Language Models: A Comparative Study of\n MPT-7b-instruct, Falcon-7b-instruct, and OpenAI Chat-GPT Models","summary":" Text summarization is a critical Natural Language Processing (NLP) task with\napplications ranging from information retrieval to content generation.\nLeveraging Large Language Models (LLMs) has shown remarkable promise in\nenhancing summarization techniques. This paper embarks on an exploration of\ntext summarization with a diverse set of LLMs, including MPT-7b-instruct,\nfalcon-7b-instruct, and OpenAI ChatGPT text-davinci-003 models. The experiment\nwas performed with different hyperparameters and evaluated the generated\nsummaries using widely accepted metrics such as the Bilingual Evaluation\nUnderstudy (BLEU) Score, Recall-Oriented Understudy for Gisting Evaluation\n(ROUGE) Score, and Bidirectional Encoder Representations from Transformers\n(BERT) Score. According to the experiment, text-davinci-003 outperformed the\nothers. This investigation involved two distinct datasets: CNN Daily Mail and\nXSum. Its primary objective was to provide a comprehensive understanding of the\nperformance of Large Language Models (LLMs) when applied to different datasets.\nThe assessment of these models' effectiveness contributes valuable insights to\nresearchers and practitioners within the NLP domain. This work serves as a\nresource for those interested in harnessing the potential of LLMs for text\nsummarization and lays the foundation for the development of advanced\nGenerative AI applications aimed at addressing a wide spectrum of business\nchallenges.\n","authors":["Lochan Basyal","Mihir Sanghvi"],"pdf_url":"https://arxiv.org/pdf/2310.10449v2.pdf","comment":"4 pages, 2 tables"},{"id":"http://arxiv.org/abs/2310.11541v1","updated":"2023-10-17T19:27:23Z","published":"2023-10-17T19:27:23Z","title":"MUST&P-SRL: Multi-lingual and Unified Syllabification in Text and\n Phonetic Domains for Speech Representation Learning","summary":" In this paper, we present a methodology for linguistic feature extraction,\nfocusing particularly on automatically syllabifying words in multiple\nlanguages, with a design to be compatible with a forced-alignment tool, the\nMontreal Forced Aligner (MFA). In both the textual and phonetic domains, our\nmethod focuses on the extraction of phonetic transcriptions from text, stress\nmarks, and a unified automatic syllabification (in text and phonetic domains).\nThe system was built with open-source components and resources. Through an\nablation study, we demonstrate the efficacy of our approach in automatically\nsyllabifying words from several languages (English, French and Spanish).\nAdditionally, we apply the technique to the transcriptions of the CMU ARCTIC\ndataset, generating valuable annotations available\nonline\\footnote{\\url{https://github.com/noetits/MUST_P-SRL}} that are ideal for\nspeech representation learning, speech unit discovery, and disentanglement of\nspeech factors in several speech-related fields.\n","authors":["Noé Tits"],"pdf_url":"https://arxiv.org/pdf/2310.11541v1.pdf","comment":"Accepted for publication at EMNLP 2023"},{"id":"http://arxiv.org/abs/2304.13007v3","updated":"2023-10-17T19:18:05Z","published":"2023-04-25T17:27:37Z","title":"Answering Questions by Meta-Reasoning over Multiple Chains of Thought","summary":" Modern systems for multi-hop question answering (QA) typically break\nquestions into a sequence of reasoning steps, termed chain-of-thought (CoT),\nbefore arriving at a final answer. Often, multiple chains are sampled and\naggregated through a voting mechanism over the final answers, but the\nintermediate steps themselves are discarded. While such approaches improve\nperformance, they do not consider the relations between intermediate steps\nacross chains and do not provide a unified explanation for the predicted\nanswer. We introduce Multi-Chain Reasoning (MCR), an approach which prompts\nlarge language models to meta-reason over multiple chains of thought, rather\nthan aggregating their answers. MCR examines different reasoning chains, mixes\ninformation between them and selects the most relevant facts in generating an\nexplanation and predicting the answer. MCR outperforms strong baselines on 7\nmulti-hop QA datasets. Moreover, our analysis reveals that MCR explanations\nexhibit high quality, enabling humans to verify its answers.\n","authors":["Ori Yoran","Tomer Wolfson","Ben Bogin","Uri Katz","Daniel Deutch","Jonathan Berant"],"pdf_url":"https://arxiv.org/pdf/2304.13007v3.pdf","comment":"Accepted for publication in The 2023 Conference on Empirical Methods\n in Natural Language Processing (EMNLP 2023). Author's final version"},{"id":"http://arxiv.org/abs/2310.11532v1","updated":"2023-10-17T19:02:40Z","published":"2023-10-17T19:02:40Z","title":"Multi-stage Large Language Model Correction for Speech Recognition","summary":" In this paper, we investigate the usage of large language models (LLMs) to\nimprove the performance of competitive speech recognition systems. Different\nfrom traditional language models that focus on one single data domain, the rise\nof LLMs brings us the opportunity to push the limit of state-of-the-art ASR\nperformance, and at the same time to achieve higher robustness and generalize\neffectively across multiple domains. Motivated by this, we propose a novel\nmulti-stage approach to combine traditional language model re-scoring and LLM\nprompting. Specifically, the proposed method has two stages: the first stage\nuses a language model to re-score an N-best list of ASR hypotheses and run a\nconfidence check; The second stage uses prompts to a LLM to perform ASR error\ncorrection on less confident results from the first stage. Our experimental\nresults demonstrate the effectiveness of the proposed method by showing a 10% ~\n20% relative improvement in WER over a competitive ASR system -- across\nmultiple test domains.\n","authors":["Jie Pu","Thai-Son Nguyen","Sebastian Stüker"],"pdf_url":"https://arxiv.org/pdf/2310.11532v1.pdf","comment":"Submitted to ICASSP 2024"},{"id":"http://arxiv.org/abs/2210.03350v3","updated":"2023-10-17T18:57:17Z","published":"2022-10-07T06:50:23Z","title":"Measuring and Narrowing the Compositionality Gap in Language Models","summary":" We investigate the ability of language models to perform compositional\nreasoning tasks where the overall solution depends on correctly composing the\nanswers to sub-problems. We measure how often models can correctly answer all\nsub-problems but not generate the overall solution, a ratio we call the\ncompositionality gap. We evaluate this ratio by asking multi-hop questions with\nanswers that require composing multiple facts unlikely to have been observed\ntogether during pretraining. In the GPT-3 family of models, as model size\nincreases we show that the single-hop question answering performance improves\nfaster than the multi-hop performance does, therefore the compositionality gap\ndoes not decrease. This surprising result suggests that while more powerful\nmodels memorize and recall more factual knowledge, they show no corresponding\nimprovement in their ability to perform this kind of compositional reasoning.\n We then demonstrate how elicitive prompting (such as chain of thought)\nnarrows the compositionality gap by reasoning explicitly. We present a new\nmethod, self-ask, that further improves on chain of thought. In our method, the\nmodel explicitly asks itself (and answers) follow-up questions before answering\nthe initial question. We finally show that self-ask's structured prompting lets\nus easily plug in a search engine to answer the follow-up questions, which\nadditionally improves accuracy.\n","authors":["Ofir Press","Muru Zhang","Sewon Min","Ludwig Schmidt","Noah A. Smith","Mike Lewis"],"pdf_url":"https://arxiv.org/pdf/2210.03350v3.pdf","comment":"To appear at Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.11523v1","updated":"2023-10-17T18:41:57Z","published":"2023-10-17T18:41:57Z","title":"Group Preference Optimization: Few-Shot Alignment of Large Language\n Models","summary":" Many applications of large language models (LLMs), ranging from chatbots to\ncreative writing, require nuanced subjective judgments that can differ\nsignificantly across different groups. Existing alignment algorithms can be\nexpensive to align for each group, requiring prohibitive amounts of\ngroup-specific preference data and computation for real-world use cases. We\nintroduce Group Preference Optimization (GPO), an alignment framework that\nsteers language models to preferences of individual groups in a few-shot\nmanner. In GPO, we augment the base LLM with an independent transformer module\ntrained to predict the preferences of a group for the LLM generations. For\nfew-shot learning, we parameterize this module as an in-context autoregressive\ntransformer and train it via meta-learning on several groups. We empirically\nvalidate the efficacy of GPO through rigorous evaluations using LLMs with\nvaried sizes on three human opinion adaptation tasks. These tasks involve\nadapting to the preferences of US demographic groups, global countries, and\nindividual users. Our results demonstrate that GPO not only aligns models more\naccurately but also requires fewer group-specific preferences, and less\ntraining and inference computing resources, outperforming existing strategies\nsuch as in-context steering and fine-tuning methods.\n","authors":["Siyan Zhao","John Dang","Aditya Grover"],"pdf_url":"https://arxiv.org/pdf/2310.11523v1.pdf","comment":"24 pages, 12 figures"},{"id":"http://arxiv.org/abs/2310.11520v1","updated":"2023-10-17T18:38:03Z","published":"2023-10-17T18:38:03Z","title":"Automatic News Summerization","summary":" Natural Language Processing is booming with its applications in the real\nworld, one of which is Text Summarization for large texts including news\narticles. This research paper provides an extensive comparative evaluation of\nextractive and abstractive approaches for news text summarization, with an\nemphasis on the ROUGE score analysis. The study employs the CNN-Daily Mail\ndataset, which consists of news articles and human-generated reference\nsummaries. The evaluation employs ROUGE scores to assess the efficacy and\nquality of generated summaries. After Evaluation, we integrate the\nbest-performing models on a web application to assess their real-world\ncapabilities and user experience.\n","authors":["Kavach Dheer","Arpit Dhankhar"],"pdf_url":"https://arxiv.org/pdf/2310.11520v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.11649v2","updated":"2023-10-17T18:21:27Z","published":"2023-02-22T20:56:40Z","title":"Grounding Complex Natural Language Commands for Temporal Tasks in Unseen\n Environments","summary":" Grounding navigational commands to linear temporal logic (LTL) leverages its\nunambiguous semantics for reasoning about long-horizon tasks and verifying the\nsatisfaction of temporal constraints. Existing approaches require training data\nfrom the specific environment and landmarks that will be used in natural\nlanguage to understand commands in those environments. We propose Lang2LTL, a\nmodular system and a software package that leverages large language models\n(LLMs) to ground temporal navigational commands to LTL specifications in\nenvironments without prior language data. We comprehensively evaluate Lang2LTL\nfor five well-defined generalization behaviors. Lang2LTL demonstrates the\nstate-of-the-art ability of a single model to ground navigational commands to\ndiverse temporal specifications in 21 city-scaled environments. Finally, we\ndemonstrate a physical robot using Lang2LTL can follow 52 semantically diverse\nnavigational commands in two indoor environments.\n","authors":["Jason Xinyu Liu","Ziyi Yang","Ifrah Idrees","Sam Liang","Benjamin Schornstein","Stefanie Tellex","Ankit Shah"],"pdf_url":"https://arxiv.org/pdf/2302.11649v2.pdf","comment":"Conference on Robot Learning 2023"},{"id":"http://arxiv.org/abs/2310.11511v1","updated":"2023-10-17T18:18:32Z","published":"2023-10-17T18:18:32Z","title":"Self-RAG: Learning to Retrieve, Generate, and Critique through\n Self-Reflection","summary":" Despite their remarkable capabilities, large language models (LLMs) often\nproduce responses containing factual inaccuracies due to their sole reliance on\nthe parametric knowledge they encapsulate. Retrieval-Augmented Generation\n(RAG), an ad hoc approach that augments LMs with retrieval of relevant\nknowledge, decreases such issues. However, indiscriminately retrieving and\nincorporating a fixed number of retrieved passages, regardless of whether\nretrieval is necessary, or passages are relevant, diminishes LM versatility or\ncan lead to unhelpful response generation. We introduce a new framework called\nSelf-Reflective Retrieval-Augmented Generation (Self-RAG) that enhances an LM's\nquality and factuality through retrieval and self-reflection. Our framework\ntrains a single arbitrary LM that adaptively retrieves passages on-demand, and\ngenerates and reflects on retrieved passages and its own generations using\nspecial tokens, called reflection tokens. Generating reflection tokens makes\nthe LM controllable during the inference phase, enabling it to tailor its\nbehavior to diverse task requirements. Experiments show that Self-RAG (7B and\n13B parameters) significantly outperforms state-of-the-art LLMs and\nretrieval-augmented models on a diverse set of tasks. Specifically, Self-RAG\noutperforms ChatGPT and retrieval-augmented Llama2-chat on Open-domain QA,\nreasoning and fact verification tasks, and it shows significant gains in\nimproving factuality and citation accuracy for long-form generations relative\nto these models.\n","authors":["Akari Asai","Zeqiu Wu","Yizhong Wang","Avirup Sil","Hannaneh Hajishirzi"],"pdf_url":"https://arxiv.org/pdf/2310.11511v1.pdf","comment":"30 pages, 2 figures, 12 tables"},{"id":"http://arxiv.org/abs/2310.07177v2","updated":"2023-10-17T18:02:19Z","published":"2023-10-11T04:03:42Z","title":"Online Speculative Decoding","summary":" Speculative decoding is a pivotal technique to accelerate the inference of\nlarge language models (LLMs) by employing a smaller draft model to predict the\ntarget model's outputs. However, its efficacy can be limited due to the low\npredictive accuracy of the draft model, particularly when faced with diverse\ntext inputs and a significant capability gap between the draft and target\nmodels. We introduce online speculative decoding (OSD) to address this\nchallenge. The main idea is to continually update (multiple) draft model(s) on\nobserved user query data using the abundant excess computational power in an\nLLM serving cluster. Given that LLM inference is memory-bounded, the surplus\ncomputational power in a typical LLM serving cluster can be repurposed for\nonline retraining of draft models, thereby making the training cost-neutral.\nSince the query distribution of an LLM service is relatively simple, retraining\non query distribution enables the draft model to more accurately predict the\ntarget model's outputs, particularly on data originating from query\ndistributions. As the draft model evolves online, it aligns with the query\ndistribution in real time, mitigating distribution shifts. We develop a\nprototype of online speculative decoding based on online knowledge distillation\nand evaluate it using both synthetic and real query data on several popular\nLLMs. The results show a substantial increase in the token acceptance rate by\n0.1 to 0.65, which translates into 1.22x to 3.06x latency reduction.\n","authors":["Xiaoxuan Liu","Lanxiang Hu","Peter Bailis","Ion Stoica","Zhijie Deng","Alvin Cheung","Hao Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.07177v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11501v1","updated":"2023-10-17T18:00:25Z","published":"2023-10-17T18:00:25Z","title":"CoMPosT: Characterizing and Evaluating Caricature in LLM Simulations","summary":" Recent work has aimed to capture nuances of human behavior by using LLMs to\nsimulate responses from particular demographics in settings like social science\nexperiments and public opinion surveys. However, there are currently no\nestablished ways to discuss or evaluate the quality of such LLM simulations.\nMoreover, there is growing concern that these LLM simulations are flattened\ncaricatures of the personas that they aim to simulate, failing to capture the\nmultidimensionality of people and perpetuating stereotypes. To bridge these\ngaps, we present CoMPosT, a framework to characterize LLM simulations using\nfour dimensions: Context, Model, Persona, and Topic. We use this framework to\nmeasure open-ended LLM simulations' susceptibility to caricature, defined via\ntwo criteria: individuation and exaggeration. We evaluate the level of\ncaricature in scenarios from existing work on LLM simulations. We find that for\nGPT-4, simulations of certain demographics (political and marginalized groups)\nand topics (general, uncontroversial) are highly susceptible to caricature.\n","authors":["Myra Cheng","Tiziano Piccardi","Diyi Yang"],"pdf_url":"https://arxiv.org/pdf/2310.11501v1.pdf","comment":"To appear at EMNLP 2023 (Main)"}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2310.11449v1","updated":"2023-10-17T17:58:00Z","published":"2023-10-17T17:58:00Z","title":"DELIFFAS: Deformable Light Fields for Fast Avatar Synthesis","summary":" Generating controllable and photorealistic digital human avatars is a\nlong-standing and important problem in Vision and Graphics. Recent methods have\nshown great progress in terms of either photorealism or inference speed while\nthe combination of the two desired properties still remains unsolved. To this\nend, we propose a novel method, called DELIFFAS, which parameterizes the\nappearance of the human as a surface light field that is attached to a\ncontrollable and deforming human mesh model. At the core, we represent the\nlight field around the human with a deformable two-surface parameterization,\nwhich enables fast and accurate inference of the human appearance. This allows\nperceptual supervision on the full image compared to previous approaches that\ncould only supervise individual pixels or small patches due to their slow\nruntime. Our carefully designed human representation and supervision strategy\nleads to state-of-the-art synthesis results and inference time. The video\nresults and code are available at\nhttps://vcai.mpi-inf.mpg.de/projects/DELIFFAS.\n","authors":["Youngjoong Kwon","Lingjie Liu","Henry Fuchs","Marc Habermann","Christian Theobalt"],"pdf_url":"https://arxiv.org/pdf/2310.11449v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11448v1","updated":"2023-10-17T17:57:38Z","published":"2023-10-17T17:57:38Z","title":"4K4D: Real-Time 4D View Synthesis at 4K Resolution","summary":" This paper targets high-fidelity and real-time view synthesis of dynamic 3D\nscenes at 4K resolution. Recently, some methods on dynamic view synthesis have\nshown impressive rendering quality. However, their speed is still limited when\nrendering high-resolution images. To overcome this problem, we propose 4K4D, a\n4D point cloud representation that supports hardware rasterization and enables\nunprecedented rendering speed. Our representation is built on a 4D feature grid\nso that the points are naturally regularized and can be robustly optimized. In\naddition, we design a novel hybrid appearance model that significantly boosts\nthe rendering quality while preserving efficiency. Moreover, we develop a\ndifferentiable depth peeling algorithm to effectively learn the proposed model\nfrom RGB videos. Experiments show that our representation can be rendered at\nover 400 FPS on the DNA-Rendering dataset at 1080p resolution and 80 FPS on the\nENeRF-Outdoor dataset at 4K resolution using an RTX 4090 GPU, which is 30x\nfaster than previous methods and achieves the state-of-the-art rendering\nquality. We will release the code for reproducibility.\n","authors":["Zhen Xu","Sida Peng","Haotong Lin","Guangzhao He","Jiaming Sun","Yujun Shen","Hujun Bao","Xiaowei Zhou"],"pdf_url":"https://arxiv.org/pdf/2310.11448v1.pdf","comment":"Project Page: https://zju3dv.github.io/4k4d"},{"id":"http://arxiv.org/abs/2309.16646v2","updated":"2023-10-17T17:54:37Z","published":"2023-09-28T17:51:05Z","title":"Improving Equivariance in State-of-the-Art Supervised Depth and Normal\n Predictors","summary":" Dense depth and surface normal predictors should possess the equivariant\nproperty to cropping-and-resizing -- cropping the input image should result in\ncropping the same output image. However, we find that state-of-the-art depth\nand normal predictors, despite having strong performances, surprisingly do not\nrespect equivariance. The problem exists even when crop-and-resize data\naugmentation is employed during training. To remedy this, we propose an\nequivariant regularization technique, consisting of an averaging procedure and\na self-consistency loss, to explicitly promote cropping-and-resizing\nequivariance in depth and normal networks. Our approach can be applied to both\nCNN and Transformer architectures, does not incur extra cost during testing,\nand notably improves the supervised and semi-supervised learning performance of\ndense predictors on Taskonomy tasks. Finally, finetuning with our loss on\nunlabeled images improves not only equivariance but also accuracy of\nstate-of-the-art depth and normal predictors when evaluated on NYU-v2. GitHub\nlink: https://github.com/mikuhatsune/equivariance\n","authors":["Yuanyi Zhong","Anand Bhattad","Yu-Xiong Wang","David Forsyth"],"pdf_url":"https://arxiv.org/pdf/2309.16646v2.pdf","comment":"ICCV 2023"},{"id":"http://arxiv.org/abs/2310.11441v1","updated":"2023-10-17T17:51:31Z","published":"2023-10-17T17:51:31Z","title":"Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V","summary":" We present Set-of-Mark (SoM), a new visual prompting method, to unleash the\nvisual grounding abilities of large multimodal models (LMMs), such as GPT-4V.\nAs illustrated in Fig. 1 (right), we employ off-the-shelf interactive\nsegmentation models, such as SAM, to partition an image into regions at\ndifferent levels of granularity, and overlay these regions with a set of marks\ne.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can\nanswer the questions that require visual grounding. We perform a comprehensive\nempirical study to validate the effectiveness of SoM on a wide range of\nfine-grained vision and multimodal tasks. For example, our experiments show\nthat GPT-4V with SoM outperforms the state-of-the-art fully-finetuned referring\nsegmentation model on RefCOCOg in a zero-shot setting.\n","authors":["Jianwei Yang","Hao Zhang","Feng Li","Xueyan Zou","Chunyuan Li","Jianfeng Gao"],"pdf_url":"https://arxiv.org/pdf/2310.11441v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11440v1","updated":"2023-10-17T17:50:46Z","published":"2023-10-17T17:50:46Z","title":"EvalCrafter: Benchmarking and Evaluating Large Video Generation Models","summary":" The vision and language generative models have been overgrown in recent\nyears. For video generation, various open-sourced models and public-available\nservices are released for generating high-visual quality videos. However, these\nmethods often use a few academic metrics, for example, FVD or IS, to evaluate\nthe performance. We argue that it is hard to judge the large conditional\ngenerative models from the simple metrics since these models are often trained\non very large datasets with multi-aspect abilities. Thus, we propose a new\nframework and pipeline to exhaustively evaluate the performance of the\ngenerated videos. To achieve this, we first conduct a new prompt list for\ntext-to-video generation by analyzing the real-world prompt list with the help\nof the large language model. Then, we evaluate the state-of-the-art video\ngenerative models on our carefully designed benchmarks, in terms of visual\nqualities, content qualities, motion qualities, and text-caption alignment with\naround 18 objective metrics. To obtain the final leaderboard of the models, we\nalso fit a series of coefficients to align the objective metrics to the users'\nopinions. Based on the proposed opinion alignment method, our final score shows\na higher correlation than simply averaging the metrics, showing the\neffectiveness of the proposed evaluation method.\n","authors":["Yaofang Liu","Xiaodong Cun","Xuebo Liu","Xintao Wang","Yong Zhang","Haoxin Chen","Yang Liu","Tieyong Zeng","Raymond Chan","Ying Shan"],"pdf_url":"https://arxiv.org/pdf/2310.11440v1.pdf","comment":"Technical Report"},{"id":"http://arxiv.org/abs/2303.09447v2","updated":"2023-10-17T17:46:01Z","published":"2023-03-16T16:23:13Z","title":"Steering Prototypes with Prompt-tuning for Rehearsal-free Continual\n Learning","summary":" In the context of continual learning, prototypes-as representative class\nembeddings-offer advantages in memory conservation and the mitigation of\ncatastrophic forgetting. However, challenges related to semantic drift and\nprototype interference persist. In this study, we introduce the Contrastive\nPrototypical Prompt (CPP) approach. Through task-specific prompt-tuning,\nunderpinned by a contrastive learning objective, we effectively address both\naforementioned challenges. Our evaluations on four challenging\nclass-incremental benchmarks reveal that CPP achieves a significant 4% to 6%\nimprovement over state-of-the-art methods. Importantly, CPP operates without a\nrehearsal buffer and narrows the performance divergence between continual and\noffline joint-learning, suggesting an innovative scheme for Transformer-based\ncontinual learning systems.\n","authors":["Zhuowei Li","Long Zhao","Zizhao Zhang","Han Zhang","Di Liu","Ting Liu","Dimitris N. Metaxas"],"pdf_url":"https://arxiv.org/pdf/2303.09447v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2203.16482v3","updated":"2023-10-17T17:37:54Z","published":"2022-03-30T17:18:11Z","title":"RFNet-4D++: Joint Object Reconstruction and Flow Estimation from 4D\n Point Clouds with Cross-Attention Spatio-Temporal Features","summary":" Object reconstruction from 3D point clouds has been a long-standing research\nproblem in computer vision and computer graphics, and achieved impressive\nprogress. However, reconstruction from time-varying point clouds (a.k.a. 4D\npoint clouds) is generally overlooked. In this paper, we propose a new network\narchitecture, namely RFNet-4D++, that jointly reconstructs objects and their\nmotion flows from 4D point clouds. The key insight is simultaneously performing\nboth tasks via learning of spatial and temporal features from a sequence of\npoint clouds can leverage individual tasks, leading to improved overall\nperformance. To prove this ability, we design a temporal vector field learning\nmodule using an unsupervised learning approach for flow estimation task,\nleveraged by supervised learning of spatial structures for object\nreconstruction. Extensive experiments and analyses on benchmark datasets\nvalidated the effectiveness and efficiency of our method. As shown in\nexperimental results, our method achieves state-of-the-art performance on both\nflow estimation and object reconstruction while performing much faster than\nexisting methods in both training and inference. Our code and data are\navailable at https://github.com/hkust-vgd/RFNet-4D\n","authors":["Tuan-Anh Vu","Duc Thanh Nguyen","Binh-Son Hua","Quang-Hieu Pham","Sai-Kit Yeung"],"pdf_url":"https://arxiv.org/pdf/2203.16482v3.pdf","comment":"TPAMI journal extension of ECCV 2022 arXiv:2203.16482"},{"id":"http://arxiv.org/abs/2310.11420v1","updated":"2023-10-17T17:28:03Z","published":"2023-10-17T17:28:03Z","title":"Revisiting Map Relations for Unsupervised Non-Rigid Shape Matching","summary":" We propose a novel unsupervised learning approach for non-rigid 3D shape\nmatching. Our approach improves upon recent state-of-the art deep functional\nmap methods and can be applied to a broad range of different challenging\nscenarios. Previous deep functional map methods mainly focus on feature\nextraction and aim exclusively at obtaining more expressive features for\nfunctional map computation. However, the importance of the functional map\ncomputation itself is often neglected and the relationship between the\nfunctional map and point-wise map is underexplored. In this paper, we\nsystematically investigate the coupling relationship between the functional map\nfrom the functional map solver and the point-wise map based on feature\nsimilarity. To this end, we propose a self-adaptive functional map solver to\nadjust the functional map regularisation for different shape matching\nscenarios, together with a vertex-wise contrastive loss to obtain more\ndiscriminative features. Using different challenging datasets (including\nnon-isometry, topological noise and partiality), we demonstrate that our method\nsubstantially outperforms previous state-of-the-art methods.\n","authors":["Dongliang Cao","Paul Roetzer","Florian Bernard"],"pdf_url":"https://arxiv.org/pdf/2310.11420v1.pdf","comment":"3DV 2024"},{"id":"http://arxiv.org/abs/2310.11417v1","updated":"2023-10-17T17:25:31Z","published":"2023-10-17T17:25:31Z","title":"VcT: Visual change Transformer for Remote Sensing Image Change Detection","summary":" Existing visual change detectors usually adopt CNNs or Transformers for\nfeature representation learning and focus on learning effective representation\nfor the changed regions between images. Although good performance can be\nobtained by enhancing the features of the change regions, however, these works\nare still limited mainly due to the ignorance of mining the unchanged\nbackground context information. It is known that one main challenge for change\ndetection is how to obtain the consistent representations for two images\ninvolving different variations, such as spatial variation, sunlight intensity,\netc. In this work, we demonstrate that carefully mining the common background\ninformation provides an important cue to learn the consistent representations\nfor the two images which thus obviously facilitates the visual change detection\nproblem. Based on this observation, we propose a novel Visual change\nTransformer (VcT) model for visual change detection problem. To be specific, a\nshared backbone network is first used to extract the feature maps for the given\nimage pair. Then, each pixel of feature map is regarded as a graph node and the\ngraph neural network is proposed to model the structured information for coarse\nchange map prediction. Top-K reliable tokens can be mined from the map and\nrefined by using the clustering algorithm. Then, these reliable tokens are\nenhanced by first utilizing self/cross-attention schemes and then interacting\nwith original features via an anchor-primary attention learning module.\nFinally, the prediction head is proposed to get a more accurate change map.\nExtensive experiments on multiple benchmark datasets validated the\neffectiveness of our proposed VcT model.\n","authors":["Bo Jiang","Zitian Wang","Xixi Wang","Ziyan Zhang","Lan Chen","Xiao Wang","Bin Luo"],"pdf_url":"https://arxiv.org/pdf/2310.11417v1.pdf","comment":"Accepted by IEEE Transactions on Geoscience and Remote Sensing (TGRS)\n 2023"},{"id":"http://arxiv.org/abs/2305.13812v2","updated":"2023-10-17T17:07:29Z","published":"2023-05-23T08:28:38Z","title":"Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for\n Improved Vision-Language Compositionality","summary":" Contrastively trained vision-language models have achieved remarkable\nprogress in vision and language representation learning, leading to\nstate-of-the-art models for various downstream multimodal tasks. However,\nrecent research has highlighted severe limitations of these models in their\nability to perform compositional reasoning over objects, attributes, and\nrelations. Scene graphs have emerged as an effective way to understand images\ncompositionally. These are graph-structured semantic representations of images\nthat contain objects, their attributes, and relations with other objects in a\nscene. In this work, we consider the scene graph parsed from text as a proxy\nfor the image scene graph and propose a graph decomposition and augmentation\nframework along with a coarse-to-fine contrastive learning objective between\nimages and text that aligns sentences of various complexities to the same\nimage. Along with this, we propose novel negative mining techniques in the\nscene graph space for improving attribute binding and relation understanding.\nThrough extensive experiments, we demonstrate the effectiveness of our approach\nthat significantly improves attribute binding, relation understanding,\nsystematic generalization, and productivity on multiple recently proposed\nbenchmarks (For example, improvements upto $18\\%$ for systematic\ngeneralization, $16.5\\%$ for relation understanding over a strong baseline),\nwhile achieving similar or better performance than CLIP on various general\nmultimodal tasks.\n","authors":["Harman Singh","Pengchuan Zhang","Qifan Wang","Mengjiao Wang","Wenhan Xiong","Jingfei Du","Yu Chen"],"pdf_url":"https://arxiv.org/pdf/2305.13812v2.pdf","comment":"EMNLP 2023 (main)"},{"id":"http://arxiv.org/abs/2310.11392v1","updated":"2023-10-17T16:45:47Z","published":"2023-10-17T16:45:47Z","title":"Towards Automatic Satellite Images Captions Generation Using Large\n Language Models","summary":" Automatic image captioning is a promising technique for conveying visual\ninformation using natural language. It can benefit various tasks in satellite\nremote sensing, such as environmental monitoring, resource management, disaster\nmanagement, etc. However, one of the main challenges in this domain is the lack\nof large-scale image-caption datasets, as they require a lot of human expertise\nand effort to create. Recent research on large language models (LLMs) has\ndemonstrated their impressive performance in natural language understanding and\ngeneration tasks. Nonetheless, most of them cannot handle images (GPT-3.5,\nFalcon, Claude, etc.), while conventional captioning models pre-trained on\ngeneral ground-view images often fail to produce detailed and accurate captions\nfor aerial images (BLIP, GIT, CM3, CM3Leon, etc.). To address this problem, we\npropose a novel approach: Automatic Remote Sensing Image Captioning (ARSIC) to\nautomatically collect captions for remote sensing images by guiding LLMs to\ndescribe their object annotations. We also present a benchmark model that\nadapts the pre-trained generative image2text model (GIT) to generate\nhigh-quality captions for remote-sensing images. Our evaluation demonstrates\nthe effectiveness of our approach for collecting captions for remote sensing\nimages.\n","authors":["Yingxu He","Qiqi Sun"],"pdf_url":"https://arxiv.org/pdf/2310.11392v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.16309v3","updated":"2023-10-17T16:34:46Z","published":"2023-05-25T17:58:14Z","title":"Imitating Task and Motion Planning with Visuomotor Transformers","summary":" Imitation learning is a powerful tool for training robot manipulation\npolicies, allowing them to learn from expert demonstrations without manual\nprogramming or trial-and-error. However, common methods of data collection,\nsuch as human supervision, scale poorly, as they are time-consuming and\nlabor-intensive. In contrast, Task and Motion Planning (TAMP) can autonomously\ngenerate large-scale datasets of diverse demonstrations. In this work, we show\nthat the combination of large-scale datasets generated by TAMP supervisors and\nflexible Transformer models to fit them is a powerful paradigm for robot\nmanipulation. To that end, we present a novel imitation learning system called\nOPTIMUS that trains large-scale visuomotor Transformer policies by imitating a\nTAMP agent. OPTIMUS introduces a pipeline for generating TAMP data that is\nspecifically curated for imitation learning and can be used to train performant\ntransformer-based policies. In this paper, we present a thorough study of the\ndesign decisions required to imitate TAMP and demonstrate that OPTIMUS can\nsolve a wide variety of challenging vision-based manipulation tasks with over\n70 different objects, ranging from long-horizon pick-and-place tasks, to shelf\nand articulated object manipulation, achieving 70 to 80% success rates. Video\nresults and code at https://mihdalal.github.io/optimus/\n","authors":["Murtaza Dalal","Ajay Mandlekar","Caelan Garrett","Ankur Handa","Ruslan Salakhutdinov","Dieter Fox"],"pdf_url":"https://arxiv.org/pdf/2305.16309v3.pdf","comment":"Conference on Robot Learning (CoRL) 2023. 8 pages, 5 figures, 2\n tables; 11 pages appendix (10 additional figures)"},{"id":"http://arxiv.org/abs/2310.11385v1","updated":"2023-10-17T16:32:38Z","published":"2023-10-17T16:32:38Z","title":"A voxel-level approach to brain age prediction: A method to assess\n regional brain aging","summary":" Brain aging is a regional phenomenon, a facet that remains relatively\nunder-explored within the realm of brain age prediction research using machine\nlearning methods. Voxel-level predictions can provide localized brain age\nestimates that can provide granular insights into the regional aging processes.\nThis is essential to understand the differences in aging trajectories in\nhealthy versus diseased subjects. In this work, a deep learning-based multitask\nmodel is proposed for voxel-level brain age prediction from T1-weighted\nmagnetic resonance images. The proposed model outperforms the models existing\nin the literature and yields valuable clinical insights when applied to both\nhealthy and diseased populations. Regional analysis is performed on the\nvoxel-level brain age predictions to understand aging trajectories of known\nanatomical regions in the brain and show that there exist disparities in\nregional aging trajectories of healthy subjects compared to ones with\nunderlying neurological disorders such as Dementia and more specifically,\nAlzheimer's disease. Our code is available at\nhttps://github.com/nehagianchandani/Voxel-level-brain-age-prediction.\n","authors":["Neha Gianchandani","Mahsa Dibaji","Johanna Ospel","Fernando Vega","Mariana Bento","M. Ethan MacDonald","Roberto Souza"],"pdf_url":"https://arxiv.org/pdf/2310.11385v1.pdf","comment":"27 pages, submitted to MELBA"},{"id":"http://arxiv.org/abs/2302.03397v2","updated":"2023-10-17T16:29:12Z","published":"2023-02-07T11:04:14Z","title":"AniPixel: Towards Animatable Pixel-Aligned Human Avatar","summary":" Although human reconstruction typically results in human-specific avatars,\nrecent 3D scene reconstruction techniques utilizing pixel-aligned features show\npromise in generalizing to new scenes. Applying these techniques to human\navatar reconstruction can result in a volumetric avatar with generalizability\nbut limited animatability due to rendering only being possible for static\nrepresentations. In this paper, we propose AniPixel, a novel animatable and\ngeneralizable human avatar reconstruction method that leverages pixel-aligned\nfeatures for body geometry prediction and RGB color blending. Technically, to\nalign the canonical space with the target space and the observation space, we\npropose a bidirectional neural skinning field based on skeleton-driven\ndeformation to establish the target-to-canonical and canonical-to-observation\ncorrespondences. Then, we disentangle the canonical body geometry into a\nnormalized neutral-sized body and a subject-specific residual for better\ngeneralizability. As the geometry and appearance are closely related, we\nintroduce pixel-aligned features to facilitate the body geometry prediction and\ndetailed surface normals to reinforce the RGB color blending. We also devise a\npose-dependent and view direction-related shading module to represent the local\nillumination variance. Experiments show that AniPixel renders comparable novel\nviews while delivering better novel pose animation results than\nstate-of-the-art methods.\n","authors":["Jinlong Fan","Jing Zhang","Zhi Hou","Dacheng Tao"],"pdf_url":"https://arxiv.org/pdf/2302.03397v2.pdf","comment":"Accepted by MM'23, code will be released at\n https://github.com/loong8888/AniPixel"},{"id":"http://arxiv.org/abs/2301.05065v2","updated":"2023-10-17T16:11:36Z","published":"2023-01-12T15:03:05Z","title":"Toward Building General Foundation Models for Language, Vision, and\n Vision-Language Understanding Tasks","summary":" Foundation models or pre-trained models have substantially improved the\nperformance of various language, vision, and vision-language understanding\ntasks. However, existing foundation models can only perform the best in one\ntype of tasks, namely language, vision, or vision-language. It is still an open\nquestion whether it is possible to construct a foundation model performing the\nbest for all the understanding tasks, which we call a general foundation model.\nIn this paper, we propose a new general foundation model, X-FM (the\nX-Foundation Model). X-FM has one language encoder, one vision encoder, and one\nfusion encoder, as well as a new training method. The training method includes\ntwo new techniques for learning X-FM from text, image, and image-text pair\ndata. One is to stop gradients from the vision-language training when learning\nthe language encoder. The other is to leverage the vision-language training to\nguide the learning of the vision encoder. Extensive experiments on benchmark\ndatasets show that X-FM can significantly outperform existing general\nfoundation models and perform better than or comparable to existing foundation\nmodels specifically for language, vision, or vision-language understanding.\nCode and pre-trained models are released at\nhttps://github.com/zhangxinsong-nlp/XFM.\n","authors":["Xinsong Zhang","Yan Zeng","Jipeng Zhang","Hang Li"],"pdf_url":"https://arxiv.org/pdf/2301.05065v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.10190v2","updated":"2023-10-17T15:44:32Z","published":"2023-06-16T21:51:04Z","title":"ALP: Action-Aware Embodied Learning for Perception","summary":" Current methods in training and benchmarking vision models exhibit an\nover-reliance on passive, curated datasets. Although models trained on these\ndatasets have shown strong performance in a wide variety of tasks such as\nclassification, detection, and segmentation, they fundamentally are unable to\ngeneralize to an ever-evolving world due to constant out-of-distribution shifts\nof input data. Therefore, instead of training on fixed datasets, can we\napproach learning in a more human-centric and adaptive manner? In this paper,\nwe introduce Action-Aware Embodied Learning for Perception (ALP), an embodied\nlearning framework that incorporates action information into representation\nlearning through a combination of optimizing a reinforcement learning policy\nand an inverse dynamics prediction objective. Our method actively explores in\ncomplex 3D environments to both learn generalizable task-agnostic visual\nrepresentations as well as collect downstream training data. We show that ALP\noutperforms existing baselines in several downstream perception tasks. In\naddition, we show that by training on actively collected data more relevant to\nthe environment and task, our method generalizes more robustly to downstream\ntasks compared to models pre-trained on fixed datasets such as ImageNet.\n","authors":["Xinran Liang","Anthony Han","Wilson Yan","Aditi Raghunathan","Pieter Abbeel"],"pdf_url":"https://arxiv.org/pdf/2306.10190v2.pdf","comment":"project website available at https://xinranliang.github.io/alp/"},{"id":"http://arxiv.org/abs/2310.11346v1","updated":"2023-10-17T15:31:28Z","published":"2023-10-17T15:31:28Z","title":"Towards Generalizable Multi-Camera 3D Object Detection via Perspective\n Debiasing","summary":" Detecting objects in 3D space using multiple cameras, known as Multi-Camera\n3D Object Detection (MC3D-Det), has gained prominence with the advent of\nbird's-eye view (BEV) approaches. However, these methods often struggle when\nfaced with unfamiliar testing environments due to the lack of diverse training\ndata encompassing various viewpoints and environments. To address this, we\npropose a novel method that aligns 3D detection with 2D camera plane results,\nensuring consistent and accurate detections. Our framework, anchored in\nperspective debiasing, helps the learning of features resilient to domain\nshifts. In our approach, we render diverse view maps from BEV features and\nrectify the perspective bias of these maps, leveraging implicit foreground\nvolumes to bridge the camera and BEV planes. This two-step process promotes the\nlearning of perspective- and context-independent features, crucial for accurate\nobject detection across varying viewpoints, camera parameters and environment\nconditions. Notably, our model-agnostic approach preserves the original network\nstructure without incurring additional inference costs, facilitating seamless\nintegration across various models and simplifying deployment. Furthermore, we\nalso show our approach achieves satisfactory results in real data when trained\nonly with virtual datasets, eliminating the need for real scene annotations.\nExperimental results on both Domain Generalization (DG) and Unsupervised Domain\nAdaptation (UDA) clearly demonstrate its effectiveness. Our code will be\nreleased.\n","authors":["Hao Lu","Yunpeng Zhang","Qing Lian","Dalong Du","Yingcong Chen"],"pdf_url":"https://arxiv.org/pdf/2310.11346v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11341v1","updated":"2023-10-17T15:24:02Z","published":"2023-10-17T15:24:02Z","title":"Dual Cognitive Architecture: Incorporating Biases and Multi-Memory\n Systems for Lifelong Learning","summary":" Artificial neural networks (ANNs) exhibit a narrow scope of expertise on\nstationary independent data. However, the data in the real world is continuous\nand dynamic, and ANNs must adapt to novel scenarios while also retaining the\nlearned knowledge to become lifelong learners. The ability of humans to excel\nat these tasks can be attributed to multiple factors ranging from cognitive\ncomputational structures, cognitive biases, and the multi-memory systems in the\nbrain. We incorporate key concepts from each of these to design a novel\nframework, Dual Cognitive Architecture (DUCA), which includes multiple\nsub-systems, implicit and explicit knowledge representation dichotomy,\ninductive bias, and a multi-memory system. The inductive bias learner within\nDUCA is instrumental in encoding shape information, effectively countering the\ntendency of ANNs to learn local textures. Simultaneously, the inclusion of a\nsemantic memory submodule facilitates the gradual consolidation of knowledge,\nreplicating the dynamics observed in fast and slow learning systems,\nreminiscent of the principles underpinning the complementary learning system in\nhuman cognition. DUCA shows improvement across different settings and datasets,\nand it also exhibits reduced task recency bias, without the need for extra\ninformation. To further test the versatility of lifelong learning methods on a\nchallenging distribution shift, we introduce a novel domain-incremental dataset\nDN4IL. In addition to improving performance on existing benchmarks, DUCA also\ndemonstrates superior performance on this complex dataset.\n","authors":["Shruthi Gowda","Bahram Zonooz","Elahe Arani"],"pdf_url":"https://arxiv.org/pdf/2310.11341v1.pdf","comment":"Published in Transactions on Machine Learning Research (TMLR)"},{"id":"http://arxiv.org/abs/2306.06081v3","updated":"2023-10-17T15:20:47Z","published":"2023-05-25T09:04:31Z","title":"CARSO: Blending Adversarial Training and Purification Improves\n Adversarial Robustness","summary":" In this work, we propose a novel adversarial defence mechanism for image\nclassification - CARSO - blending the paradigms of adversarial training and\nadversarial purification in a mutually-beneficial, robustness-enhancing way.\nThe method builds upon an adversarially-trained classifier, and learns to map\nits internal representation associated with a potentially perturbed input onto\na distribution of tentative clean reconstructions. Multiple samples from such\ndistribution are classified by the adversarially-trained model itself, and an\naggregation of its outputs finally constitutes the robust prediction of\ninterest. Experimental evaluation by a well-established benchmark of varied,\nstrong adaptive attacks, across different image datasets and classifier\narchitectures, shows that CARSO is able to defend itself against foreseen and\nunforeseen threats, including adaptive end-to-end attacks devised for\nstochastic defences. Paying a tolerable clean accuracy toll, our method\nimproves by a significant margin the state of the art for CIFAR-10 and\nCIFAR-100 $\\ell_\\infty$ robust classification accuracy against AutoAttack. Code\nand pre-trained models are available at https://github.com/emaballarin/CARSO .\n","authors":["Emanuele Ballarin","Alessio Ansuini","Luca Bortolussi"],"pdf_url":"https://arxiv.org/pdf/2306.06081v3.pdf","comment":"19 pages, 1 figure, 9 tables"},{"id":"http://arxiv.org/abs/2310.11333v1","updated":"2023-10-17T15:12:11Z","published":"2023-10-17T15:12:11Z","title":"Key Point-based Orientation Estimation of Strawberries for Robotic Fruit\n Picking","summary":" Selective robotic harvesting is a promising technological solution to address\nlabour shortages which are affecting modern agriculture in many parts of the\nworld. For an accurate and efficient picking process, a robotic harvester\nrequires the precise location and orientation of the fruit to effectively plan\nthe trajectory of the end effector. The current methods for estimating fruit\norientation employ either complete 3D information which typically requires\nregistration from multiple views or rely on fully-supervised learning\ntechniques, which require difficult-to-obtain manual annotation of the\nreference orientation. In this paper, we introduce a novel key-point-based\nfruit orientation estimation method allowing for the prediction of 3D\norientation from 2D images directly. The proposed technique can work without\nfull 3D orientation annotations but can also exploit such information for\nimproved accuracy. We evaluate our work on two separate datasets of strawberry\nimages obtained from real-world data collection scenarios. Our proposed method\nachieves state-of-the-art performance with an average error as low as\n$8^{\\circ}$, improving predictions by $\\sim30\\%$ compared to previous work\npresented in~\\cite{wagner2021efficient}. Furthermore, our method is suited for\nreal-time robotic applications with fast inference times of $\\sim30$ms.\n","authors":["Justin Le Louëdec","Grzegorz Cielniak"],"pdf_url":"https://arxiv.org/pdf/2310.11333v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11320v1","updated":"2023-10-17T14:58:18Z","published":"2023-10-17T14:58:18Z","title":"Towards Generic Semi-Supervised Framework for Volumetric Medical Image\n Segmentation","summary":" Volume-wise labeling in 3D medical images is a time-consuming task that\nrequires expertise. As a result, there is growing interest in using\nsemi-supervised learning (SSL) techniques to train models with limited labeled\ndata. However, the challenges and practical applications extend beyond SSL to\nsettings such as unsupervised domain adaptation (UDA) and semi-supervised\ndomain generalization (SemiDG). This work aims to develop a generic SSL\nframework that can handle all three settings. We identify two main obstacles to\nachieving this goal in the existing SSL framework: 1) the weakness of capturing\ndistribution-invariant features; and 2) the tendency for unlabeled data to be\noverwhelmed by labeled data, leading to over-fitting to the labeled data during\ntraining. To address these issues, we propose an Aggregating & Decoupling\nframework. The aggregating part consists of a Diffusion encoder that constructs\na common knowledge set by extracting distribution-invariant features from\naggregated information from multiple distributions/domains. The decoupling part\nconsists of three decoders that decouple the training process with labeled and\nunlabeled data, thus avoiding over-fitting to labeled data, specific domains\nand classes. We evaluate our proposed framework on four benchmark datasets for\nSSL, Class-imbalanced SSL, UDA and SemiDG. The results showcase notable\nimprovements compared to state-of-the-art methods across all four settings,\nindicating the potential of our framework to tackle more challenging SSL\nscenarios. Code and models are available at:\nhttps://github.com/xmed-lab/GenericSSL.\n","authors":["Haonan Wang","Xiaomeng Li"],"pdf_url":"https://arxiv.org/pdf/2310.11320v1.pdf","comment":"Accepted at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2207.07173v3","updated":"2023-10-17T14:52:13Z","published":"2022-07-14T19:16:56Z","title":"Deep Image Clustering with Contrastive Learning and Multi-scale Graph\n Convolutional Networks","summary":" Deep clustering has shown its promising capability in joint representation\nlearning and clustering via deep neural networks. Despite the significant\nprogress, the existing deep clustering works mostly utilize some\ndistribution-based clustering loss, lacking the ability to unify representation\nlearning and multi-scale structure learning. To address this, this paper\npresents a new deep clustering approach termed image clustering with\ncontrastive learning and multi-scale graph convolutional networks (IcicleGCN),\nwhich bridges the gap between convolutional neural network (CNN) and graph\nconvolutional network (GCN) as well as the gap between contrastive learning and\nmulti-scale structure learning for the deep clustering task. Our framework\nconsists of four main modules, namely, the CNN-based backbone, the Instance\nSimilarity Module (ISM), the Joint Cluster Structure Learning and Instance\nreconstruction Module (JC-SLIM), and the Multi-scale GCN module (M-GCN).\nSpecifically, the backbone network with two weight-sharing views is utilized to\nlearn the representations for the two augmented samples (from each image). The\nlearned representations are then fed to ISM and JC-SLIM for joint\ninstance-level and cluster-level contrastive learning, respectively, during\nwhich an auto-encoder in JC-SLIM is also pretrained to serve as a bridge to the\nM-GCN module. Further, to enforce multi-scale neighborhood structure learning,\ntwo streams of GCNs and the auto-encoder are simultaneously trained via (i) the\nlayer-wise interaction with representation fusion and (ii) the joint\nself-adaptive learning. Experiments on multiple image datasets demonstrate the\nsuperior clustering performance of IcicleGCN over the state-of-the-art. The\ncode is available at https://github.com/xuyuankun631/IcicleGCN.\n","authors":["Yuankun Xu","Dong Huang","Chang-Dong Wang","Jian-Huang Lai"],"pdf_url":"https://arxiv.org/pdf/2207.07173v3.pdf","comment":"To appear in the Pattern Recognition journal"},{"id":"http://arxiv.org/abs/2310.11316v1","updated":"2023-10-17T14:48:02Z","published":"2023-10-17T14:48:02Z","title":"MonoSKD: General Distillation Framework for Monocular 3D Object\n Detection via Spearman Correlation Coefficient","summary":" Monocular 3D object detection is an inherently ill-posed problem, as it is\nchallenging to predict accurate 3D localization from a single image. Existing\nmonocular 3D detection knowledge distillation methods usually project the LiDAR\nonto the image plane and train the teacher network accordingly. Transferring\nLiDAR-based model knowledge to RGB-based models is more complex, so a general\ndistillation strategy is needed. To alleviate cross-modal prob-lem, we propose\nMonoSKD, a novel Knowledge Distillation framework for Monocular 3D detection\nbased on Spearman correlation coefficient, to learn the relative correlation\nbetween cross-modal features. Considering the large gap between these features,\nstrict alignment of features may mislead the training, so we propose a looser\nSpearman loss. Furthermore, by selecting appropriate distillation locations and\nremoving redundant modules, our scheme saves more GPU resources and trains\nfaster than existing methods. Extensive experiments are performed to verify the\neffectiveness of our framework on the challenging KITTI 3D object detection\nbenchmark. Our method achieves state-of-the-art performance until submission\nwith no additional inference computational cost. Our codes are available at\nhttps://github.com/Senwang98/MonoSKD\n","authors":["Sen Wang","Jin Zheng"],"pdf_url":"https://arxiv.org/pdf/2310.11316v1.pdf","comment":"Accepted by ECAI 2023"},{"id":"http://arxiv.org/abs/2310.11307v1","updated":"2023-10-17T14:32:49Z","published":"2023-10-17T14:32:49Z","title":"Multi Self-supervised Pre-fine-tuned Transformer Fusion for Better\n Intelligent Transportation Detection","summary":" Intelligent transportation system combines advanced information technology to\nprovide intelligent services such as monitoring, detection, and early warning\nfor modern transportation. Intelligent transportation detection is the\ncornerstone of many intelligent traffic services by identifying task targets\nthrough object detection methods. However existing detection methods in\nintelligent transportation are limited by two aspects. First, there is a\ndifference between the model knowledge pre-trained on large-scale datasets and\nthe knowledge required for target task. Second, most detection models follow\nthe pattern of single-source learning, which limits the learning ability. To\naddress these problems, we propose a Multi Self-supervised Pre-fine-tuned\nTransformer Fusion (MSPTF) network, consisting of two steps: unsupervised\npre-fine-tune domain knowledge learning and multi-model fusion target task\nlearning. In the first step, we introduced self-supervised learning methods\ninto transformer model pre-fine-tune which could reduce data costs and\nalleviate the knowledge gap between pre-trained model and target task. In the\nsecond step, we take feature information differences between different model\narchitectures and different pre-fine-tune tasks into account and propose\nMulti-model Semantic Consistency Cross-attention Fusion (MSCCF) network to\ncombine different transformer model features by considering channel semantic\nconsistency and feature vector semantic consistency, which obtain more complete\nand proper fusion features for detection task. We experimented the proposed\nmethod on vehicle recognition dataset and road disease detection dataset and\nachieved 1.1%, 5.5%, 4.2% improvement compared with baseline and 0.7%, 1.8%,\n1.7% compared with sota, which proved the effectiveness of our method.\n","authors":["Juwu Zheng","Jiangtao Ren"],"pdf_url":"https://arxiv.org/pdf/2310.11307v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.11713v5","updated":"2023-10-17T14:19:13Z","published":"2023-02-23T00:33:54Z","title":"Can Pre-trained Vision and Language Models Answer Visual\n Information-Seeking Questions?","summary":" Pre-trained vision and language models have demonstrated state-of-the-art\ncapabilities over existing tasks involving images and texts, including visual\nquestion answering. However, it remains unclear whether these models possess\nthe capability to answer questions that are not only querying visual content\nbut knowledge-intensive and information-seeking. In this study, we introduce\nInfoSeek, a visual question answering dataset tailored for information-seeking\nquestions that cannot be answered with only common sense knowledge. Using\nInfoSeek, we analyze various pre-trained visual question answering models and\ngain insights into their characteristics. Our findings reveal that\nstate-of-the-art pre-trained multi-modal models (e.g., PaLI-X, BLIP2, etc.)\nface challenges in answering visual information-seeking questions, but\nfine-tuning on the InfoSeek dataset elicits models to use fine-grained\nknowledge that was learned during their pre-training. Furthermore, we show that\naccurate visual entity recognition can be used to improve performance on\nInfoSeek by retrieving relevant documents, showing a significant space for\nimprovement.\n","authors":["Yang Chen","Hexiang Hu","Yi Luan","Haitian Sun","Soravit Changpinyo","Alan Ritter","Ming-Wei Chang"],"pdf_url":"https://arxiv.org/pdf/2302.11713v5.pdf","comment":"EMNLP 2023 (main conference); Our dataset and evaluation is available\n at https://open-vision-language.github.io/infoseek/"},{"id":"http://arxiv.org/abs/2310.11295v1","updated":"2023-10-17T14:16:42Z","published":"2023-10-17T14:16:42Z","title":"CorrTalk: Correlation Between Hierarchical Speech and Facial Activity\n Variances for 3D Animation","summary":" Speech-driven 3D facial animation is a challenging cross-modal task that has\nattracted growing research interest. During speaking activities, the mouth\ndisplays strong motions, while the other facial regions typically demonstrate\ncomparatively weak activity levels. Existing approaches often simplify the\nprocess by directly mapping single-level speech features to the entire facial\nanimation, which overlook the differences in facial activity intensity leading\nto overly smoothed facial movements. In this study, we propose a novel\nframework, CorrTalk, which effectively establishes the temporal correlation\nbetween hierarchical speech features and facial activities of different\nintensities across distinct regions. A novel facial activity intensity metric\nis defined to distinguish between strong and weak facial activity, obtained by\ncomputing the short-time Fourier transform of facial vertex displacements.\nBased on the variances in facial activity, we propose a dual-branch decoding\nframework to synchronously synthesize strong and weak facial activity, which\nguarantees wider intensity facial animation synthesis. Furthermore, a weighted\nhierarchical feature encoder is proposed to establish temporal correlation\nbetween hierarchical speech features and facial activity at different\nintensities, which ensures lip-sync and plausible facial expressions. Extensive\nqualitatively and quantitatively experiments as well as a user study indicate\nthat our CorrTalk outperforms existing state-of-the-art methods. The source\ncode and supplementary video are publicly available at:\nhttps://zjchu.github.io/projects/CorrTalk/\n","authors":["Zhaojie Chu","Kailing Guo","Xiaofen Xing","Yilin Lan","Bolun Cai","Xiangmin Xu"],"pdf_url":"https://arxiv.org/pdf/2310.11295v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11284v1","updated":"2023-10-17T14:06:55Z","published":"2023-10-17T14:06:55Z","title":"Self-Supervised 3D Scene Flow Estimation and Motion Prediction using\n Local Rigidity Prior","summary":" In this article, we investigate self-supervised 3D scene flow estimation and\nclass-agnostic motion prediction on point clouds. A realistic scene can be well\nmodeled as a collection of rigidly moving parts, therefore its scene flow can\nbe represented as a combination of the rigid motion of these individual parts.\nBuilding upon this observation, we propose to generate pseudo scene flow labels\nfor self-supervised learning through piecewise rigid motion estimation, in\nwhich the source point cloud is decomposed into local regions and each region\nis treated as rigid. By rigidly aligning each region with its potential\ncounterpart in the target point cloud, we obtain a region-specific rigid\ntransformation to generate its pseudo flow labels. To mitigate the impact of\npotential outliers on label generation, when solving the rigid registration for\neach region, we alternately perform three steps: establishing point\ncorrespondences, measuring the confidence for the correspondences, and updating\nthe rigid transformation based on the correspondences and their confidence. As\na result, confident correspondences will dominate label generation and a\nvalidity mask will be derived for the generated pseudo labels. By using the\npseudo labels together with their validity mask for supervision, models can be\ntrained in a self-supervised manner. Extensive experiments on FlyingThings3D\nand KITTI datasets demonstrate that our method achieves new state-of-the-art\nperformance in self-supervised scene flow learning, without any ground truth\nscene flow for supervision, even performing better than some supervised\ncounterparts. Additionally, our method is further extended to class-agnostic\nmotion prediction and significantly outperforms previous state-of-the-art\nself-supervised methods on nuScenes dataset.\n","authors":["Ruibo Li","Chi Zhang","Zhe Wang","Chunhua Shen","Guosheng Lin"],"pdf_url":"https://arxiv.org/pdf/2310.11284v1.pdf","comment":"An extension of our CVPR 2022 paper (RigidFlow: Self-Supervised Scene\n Flow Learning on Point Clouds by Local Rigidity Prior)"},{"id":"http://arxiv.org/abs/2310.11276v1","updated":"2023-10-17T13:55:43Z","published":"2023-10-17T13:55:43Z","title":"Video Super-Resolution Using a Grouped Residual in Residual Network","summary":" Super-resolution (SR) is the technique of increasing the nominal resolution\nof image / video content accompanied with quality improvement. Video\nsuper-resolution (VSR) can be considered as the generalization of single image\nsuper-resolution (SISR). This generalization should be such that more detail is\ncreated in the output using adjacent input frames. In this paper, we propose a\ngrouped residual in residual network (GRRN) for VSR. By adjusting the\nhyperparameters of the proposed structure, we train three networks with\ndifferent numbers of parameters and compare their quantitative and qualitative\nresults with the existing methods. Although based on some quantitative\ncriteria, GRRN does not provide better results than the existing methods, in\nterms of the quality of the output image it has acceptable performance.\n","authors":["MohammadHossein Ashoori","Arash Amini"],"pdf_url":"https://arxiv.org/pdf/2310.11276v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2209.03737v3","updated":"2023-10-17T13:43:34Z","published":"2022-09-07T13:08:38Z","title":"CGAN-ECT: Tomography Image Reconstruction from Electrical Capacitance\n Measurements Using CGANs","summary":" Due to the rapid growth of Electrical Capacitance Tomography (ECT)\napplications in several industrial fields, there is a crucial need for\ndeveloping high quality, yet fast, methodologies of image reconstruction from\nraw capacitance measurements. Deep learning, as an effective non-linear mapping\ntool for complicated functions, has been going viral in many fields including\nelectrical tomography. In this paper, we propose a Conditional Generative\nAdversarial Network (CGAN) model for reconstructing ECT images from capacitance\nmeasurements. The initial image of the CGAN model is constructed from the\ncapacitance measurement. To our knowledge, this is the first time to represent\nthe capacitance measurements in an image form. We have created a new massive\nECT dataset of 320K synthetic image measurements pairs for training, and\ntesting the proposed model. The feasibility and generalization ability of the\nproposed CGAN-ECT model are evaluated using testing dataset, contaminated data\nand flow patterns that are not exposed to the model during the training phase.\nThe evaluation results prove that the proposed CGAN-ECT model can efficiently\ncreate more accurate ECT images than traditional and other deep learning-based\nimage reconstruction algorithms. CGAN-ECT achieved an average image correlation\ncoefficient of more than 99.3% and an average relative image error about 0.07.\n","authors":["Wael Deabes","Alaa E. Abdel-Hakim"],"pdf_url":"https://arxiv.org/pdf/2209.03737v3.pdf","comment":"13 pages, 10 figures, 6 tables"},{"id":"http://arxiv.org/abs/2310.11265v1","updated":"2023-10-17T13:38:38Z","published":"2023-10-17T13:38:38Z","title":"Image Compression using only Attention based Neural Networks","summary":" In recent research, Learned Image Compression has gained prominence for its\ncapacity to outperform traditional handcrafted pipelines, especially at low\nbit-rates. While existing methods incorporate convolutional priors with\noccasional attention blocks to address long-range dependencies, recent advances\nin computer vision advocate for a transformative shift towards fully\ntransformer-based architectures grounded in the attention mechanism. This paper\ninvestigates the feasibility of image compression exclusively using attention\nlayers within our novel model, QPressFormer. We introduce the concept of\nlearned image queries to aggregate patch information via cross-attention,\nfollowed by quantization and coding techniques. Through extensive evaluations,\nour work demonstrates competitive performance achieved by convolution-free\narchitectures across the popular Kodak, DIV2K, and CLIC datasets.\n","authors":["Natacha Luka","Romain Negrel","David Picard"],"pdf_url":"https://arxiv.org/pdf/2310.11265v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11257v1","updated":"2023-10-17T13:22:59Z","published":"2023-10-17T13:22:59Z","title":"An empirical study of automatic wildlife detection using drone thermal\n imaging and object detection","summary":" Artificial intelligence has the potential to make valuable contributions to\nwildlife management through cost-effective methods for the collection and\ninterpretation of wildlife data. Recent advances in remotely piloted aircraft\nsystems (RPAS or ``drones'') and thermal imaging technology have created new\napproaches to collect wildlife data. These emerging technologies could provide\npromising alternatives to standard labourious field techniques as well as cover\nmuch larger areas. In this study, we conduct a comprehensive review and\nempirical study of drone-based wildlife detection. Specifically, we collect a\nrealistic dataset of drone-derived wildlife thermal detections. Wildlife\ndetections, including arboreal (for instance, koalas, phascolarctos cinereus)\nand ground dwelling species in our collected data are annotated via bounding\nboxes by experts. We then benchmark state-of-the-art object detection\nalgorithms on our collected dataset. We use these experimental results to\nidentify issues and discuss future directions in automatic animal monitoring\nusing drones.\n","authors":["Miao Chang","Tan Vuong","Manas Palaparthi","Lachlan Howell","Alessio Bonti","Mohamed Abdelrazek","Duc Thanh Nguyen"],"pdf_url":"https://arxiv.org/pdf/2310.11257v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11256v1","updated":"2023-10-17T13:22:36Z","published":"2023-10-17T13:22:36Z","title":"Gromov-Wassertein-like Distances in the Gaussian Mixture Models Space","summary":" In this paper, we introduce two Gromov-Wasserstein-type distances on the set\nof Gaussian mixture models. The first one takes the form of a\nGromov-Wasserstein distance between two discrete distributionson the space of\nGaussian measures. This distance can be used as an alternative to\nGromov-Wasserstein for applications which only require to evaluate how far the\ndistributions are from each other but does not allow to derive directly an\noptimal transportation plan between clouds of points. To design a way to define\nsuch a transportation plan, we introduce another distance between measures\nliving in incomparable spaces that turns out to be closely related to\nGromov-Wasserstein. When restricting the set of admissible transportation\ncouplings to be themselves Gaussian mixture models in this latter, this defines\nanother distance between Gaussian mixture models that can be used as another\nalternative to Gromov-Wasserstein and which allows to derive an optimal\nassignment between points. Finally, we design a transportation plan associated\nwith the first distance by analogy with the second, and we illustrate their\npractical uses on medium-to-large scale problems such as shape matching and\nhyperspectral image color transfer.\n","authors":["Antoine Salmona","Julie Delon","Agnès Desolneux"],"pdf_url":"https://arxiv.org/pdf/2310.11256v1.pdf","comment":"preprint"},{"id":"http://arxiv.org/abs/2212.10230v3","updated":"2023-10-17T13:14:03Z","published":"2022-12-20T13:09:58Z","title":"A Comprehensive Study of the Robustness for LiDAR-based 3D Object\n Detectors against Adversarial Attacks","summary":" Recent years have witnessed significant advancements in deep learning-based\n3D object detection, leading to its widespread adoption in numerous\napplications. As 3D object detectors become increasingly crucial for\nsecurity-critical tasks, it is imperative to understand their robustness\nagainst adversarial attacks. This paper presents the first comprehensive\nevaluation and analysis of the robustness of LiDAR-based 3D detectors under\nadversarial attacks. Specifically, we extend three distinct adversarial attacks\nto the 3D object detection task, benchmarking the robustness of\nstate-of-the-art LiDAR-based 3D object detectors against attacks on the KITTI\nand Waymo datasets. We further analyze the relationship between robustness and\ndetector properties. Additionally, we explore the transferability of\ncross-model, cross-task, and cross-data attacks. Thorough experiments on\ndefensive strategies for 3D detectors are conducted, demonstrating that simple\ntransformations like flipping provide little help in improving robustness when\nthe applied transformation strategy is exposed to attackers. \\revise{Finally,\nwe propose balanced adversarial focal training, based on conventional\nadversarial training, to strike a balance between accuracy and robustness.} Our\nfindings will facilitate investigations into understanding and defending\nagainst adversarial attacks on LiDAR-based 3D object detectors, thus advancing\nthe field. The source code is publicly available at\n\\url{https://github.com/Eaphan/Robust3DOD}.\n","authors":["Yifan Zhang","Junhui Hou","Yixuan Yuan"],"pdf_url":"https://arxiv.org/pdf/2212.10230v3.pdf","comment":"30 pages, 14 figures. Accepted by IJCV"},{"id":"http://arxiv.org/abs/2310.11239v1","updated":"2023-10-17T13:08:24Z","published":"2023-10-17T13:08:24Z","title":"LiDAR-based 4D Occupancy Completion and Forecasting","summary":" Scene completion and forecasting are two popular perception problems in\nresearch for mobile agents like autonomous vehicles. Existing approaches treat\nthe two problems in isolation, resulting in a separate perception of the two\naspects. In this paper, we introduce a novel LiDAR perception task of Occupancy\nCompletion and Forecasting (OCF) in the context of autonomous driving to unify\nthese aspects into a cohesive framework. This task requires new algorithms to\naddress three challenges altogether: (1) sparse-to-dense reconstruction, (2)\npartial-to-complete hallucination, and (3) 3D-to-4D prediction. To enable\nsupervision and evaluation, we curate a large-scale dataset termed OCFBench\nfrom public autonomous driving datasets. We analyze the performance of closely\nrelated existing baseline models and our own ones on our dataset. We envision\nthat this research will inspire and call for further investigation in this\nevolving and crucial area of 4D perception. Our code for data curation and\nbaseline implementation is available at https://github.com/ai4ce/Occ4cast.\n","authors":["Xinhao Liu","Moonjun Gong","Qi Fang","Haoyu Xie","Yiming Li","Hang Zhao","Chen Feng"],"pdf_url":"https://arxiv.org/pdf/2310.11239v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.00600v2","updated":"2023-10-17T12:48:58Z","published":"2023-06-01T12:16:26Z","title":"Rotating Features for Object Discovery","summary":" The binding problem in human cognition, concerning how the brain represents\nand connects objects within a fixed network of neural connections, remains a\nsubject of intense debate. Most machine learning efforts addressing this issue\nin an unsupervised setting have focused on slot-based methods, which may be\nlimiting due to their discrete nature and difficulty to express uncertainty.\nRecently, the Complex AutoEncoder was proposed as an alternative that learns\ncontinuous and distributed object-centric representations. However, it is only\napplicable to simple toy data. In this paper, we present Rotating Features, a\ngeneralization of complex-valued features to higher dimensions, and a new\nevaluation procedure for extracting objects from distributed representations.\nAdditionally, we show the applicability of our approach to pre-trained\nfeatures. Together, these advancements enable us to scale distributed\nobject-centric representations from simple toy to real-world data. We believe\nthis work advances a new paradigm for addressing the binding problem in machine\nlearning and has the potential to inspire further innovation in the field.\n","authors":["Sindy Löwe","Phillip Lippe","Francesco Locatello","Max Welling"],"pdf_url":"https://arxiv.org/pdf/2306.00600v2.pdf","comment":"Oral presentation at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.11217v1","updated":"2023-10-17T12:45:04Z","published":"2023-10-17T12:45:04Z","title":"Innovative Methods for Non-Destructive Inspection of Handwritten\n Documents","summary":" Handwritten document analysis is an area of forensic science, with the goal\nof establishing authorship of documents through examination of inherent\ncharacteristics. Law enforcement agencies use standard protocols based on\nmanual processing of handwritten documents. This method is time-consuming, is\noften subjective in its evaluation, and is not replicable. To overcome these\nlimitations, in this paper we present a framework capable of extracting and\nanalyzing intrinsic measures of manuscript documents related to text line\nheights, space between words, and character sizes using image processing and\ndeep learning techniques. The final feature vector for each document involved\nconsists of the mean and standard deviation for every type of measure\ncollected. By quantifying the Euclidean distance between the feature vectors of\nthe documents to be compared, authorship can be discerned. We also proposed a\nnew and challenging dataset consisting of 362 handwritten manuscripts written\non paper and digital devices by 124 different people. Our study pioneered the\ncomparison between traditionally handwritten documents and those produced with\ndigital tools (e.g., tablets). Experimental results demonstrate the ability of\nour method to objectively determine authorship in different writing media,\noutperforming the state of the art.\n","authors":["Eleonora Breci","Luca Guarnera","Sebastiano Battiato"],"pdf_url":"https://arxiv.org/pdf/2310.11217v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11210v1","updated":"2023-10-17T12:39:16Z","published":"2023-10-17T12:39:16Z","title":"Learning Comprehensive Representations with Richer Self for\n Text-to-Image Person Re-Identification","summary":" Text-to-image person re-identification (TIReID) retrieves pedestrian images\nof the same identity based on a query text. However, existing methods for\nTIReID typically treat it as a one-to-one image-text matching problem, only\nfocusing on the relationship between image-text pairs within a view. The\nmany-to-many matching between image-text pairs across views under the same\nidentity is not taken into account, which is one of the main reasons for the\npoor performance of existing methods. To this end, we propose a simple yet\neffective framework, called LCR$^2$S, for modeling many-to-many correspondences\nof the same identity by learning comprehensive representations for both\nmodalities from a novel perspective. We construct a support set for each image\n(text) by using other images (texts) under the same identity and design a\nmulti-head attentional fusion module to fuse the image (text) and its support\nset. The resulting enriched image and text features fuse information from\nmultiple views, which are aligned to train a \"richer\" TIReID model with\nmany-to-many correspondences. Since the support set is unavailable during\ninference, we propose to distill the knowledge learned by the \"richer\" model\ninto a lightweight model for inference with a single image/text as input. The\nlightweight model focuses on semantic association and reasoning of multi-view\ninformation, which can generate a comprehensive representation containing\nmulti-view information with only a single-view input to perform accurate\ntext-to-image retrieval during inference. In particular, we use the intra-modal\nfeatures and inter-modal semantic relations of the \"richer\" model to supervise\nthe lightweight model to inherit its powerful capability. Extensive experiments\ndemonstrate the effectiveness of LCR$^2$S, and it also achieves new\nstate-of-the-art performance on three popular TIReID datasets.\n","authors":["Shuanglin Yan","Neng Dong","Jun Liu","Liyan Zhang","Jinhui Tang"],"pdf_url":"https://arxiv.org/pdf/2310.11210v1.pdf","comment":"Accepted by ACM MM 2023"},{"id":"http://arxiv.org/abs/2310.11204v1","updated":"2023-10-17T12:30:46Z","published":"2023-10-17T12:30:46Z","title":"Improving Video Deepfake Detection: A DCT-Based Approach with\n Patch-Level Analysis","summary":" The term deepfake refers to all those multimedia contents that were\nsynthetically altered or created from scratch through the use of generative\nmodels. This phenomenon has become widespread due to the use of increasingly\naccurate and efficient architectures capable of rendering manipulated content\nindistinguishable from real content. In order to fight the illicit use of this\npowerful technology, it has become necessary to develop algorithms able to\ndistinguish synthetic content from real ones. In this study, a new algorithm\nfor the detection of deepfakes in digital videos is presented, focusing on the\nmain goal of creating a fast and explainable method from a forensic\nperspective. To achieve this goal, the I-frames were extracted in order to\nprovide faster computation and analysis than approaches described in\nliterature. In addition, to identify the most discriminating regions within\nindividual video frames, the entire frame, background, face, eyes, nose, mouth,\nand face frame were analyzed separately. From the Discrete Cosine Transform\n(DCT), the Beta components were extracted from the AC coefficients and used as\ninput to standard classifiers (e.g., k-NN, SVM, and others) in order to\nidentify those frequencies most discriminative for solving the task in\nquestion. Experimental results obtained on the Faceforensics++ and Celeb-DF\n(v2) datasets show that the eye and mouth regions are those most discriminative\nand able to determine the nature of the video with greater reliability than the\nanalysis of the whole frame. The method proposed in this study is analytical,\nfast and does not require much computational power.\n","authors":["Luca Guarnera","Salvatore Manganello","Sebastiano Battiato"],"pdf_url":"https://arxiv.org/pdf/2310.11204v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.13785v2","updated":"2023-10-17T12:24:24Z","published":"2023-04-26T19:05:34Z","title":"Customized Segment Anything Model for Medical Image Segmentation","summary":" We propose SAMed, a general solution for medical image segmentation.\nDifferent from the previous methods, SAMed is built upon the large-scale image\nsegmentation model, Segment Anything Model (SAM), to explore the new research\nparadigm of customizing large-scale models for medical image segmentation.\nSAMed applies the low-rank-based (LoRA) finetuning strategy to the SAM image\nencoder and finetunes it together with the prompt encoder and the mask decoder\non labeled medical image segmentation datasets. We also observe the warmup\nfinetuning strategy and the AdamW optimizer lead SAMed to successful\nconvergence and lower loss. Different from SAM, SAMed could perform semantic\nsegmentation on medical images. Our trained SAMed model achieves 81.88 DSC and\n20.64 HD on the Synapse multi-organ segmentation dataset, which is on par with\nthe state-of-the-art methods. We conduct extensive experiments to validate the\neffectiveness of our design. Since SAMed only updates a small fraction of the\nSAM parameters, its deployment cost and storage cost are quite marginal in\npractical usage. The code of SAMed is available at\nhttps://github.com/hitachinsk/SAMed.\n","authors":["Kaidong Zhang","Dong Liu"],"pdf_url":"https://arxiv.org/pdf/2304.13785v2.pdf","comment":"Technical report, 14 pages"},{"id":"http://arxiv.org/abs/2305.11490v4","updated":"2023-10-17T12:16:03Z","published":"2023-05-19T07:44:39Z","title":"LLM-CXR: Instruction-Finetuned LLM for CXR Image Understanding and\n Generation","summary":" Following the impressive development of LLMs, vision-language alignment in\nLLMs is actively being researched to enable multimodal reasoning and visual IO.\nThis direction of research is particularly relevant to medical imaging because\nmedical image analysis and generation consist of reasoning based on a\ncombination of visual features and prior knowledge. Many recent works have\nfocused on training adapter networks that serve as an information bridge\nbetween image processing networks and LLMs; but presumably, in order to achieve\nmaximum reasoning potential of LLMs on visual information as well, visual and\nlanguage features should be allowed to interact more freely. This is especially\nimportant in the medical domain because understanding and generating medical\nimages such as chest X-rays (CXR) require not only accurate visual and\nlanguage-based reasoning but also a more intimate mapping between the two\nmodalities. Thus, taking inspiration from previous work on the transformer and\nVQ-GAN combination for bidirectional image and text generation, we build upon\nthis approach and develop a method for instruction-tuning an LLM pre-trained\nonly on text to gain vision-language capabilities for medical images.\nSpecifically, we leverage a pretrained LLM's existing question-answering and\ninstruction-following abilities to teach it to understand visual inputs by\ninstructing it to answer questions about image inputs and, symmetrically,\noutput both text and image responses appropriate to a given query by tuning the\nLLM with diverse tasks that encompass image-based text-generation and\ntext-based image-generation. We show that our model, LLM-CXR, trained in this\napproach shows better image-text alignment in both CXR understanding and\ngeneration tasks while being smaller in size compared to previously developed\nmodels that perform a narrower range of tasks. The code is at\nhttps://github.com/hyn2028/llm-cxr.\n","authors":["Suhyeon Lee","Won Jun Kim","Jinho Chang","Jong Chul Ye"],"pdf_url":"https://arxiv.org/pdf/2305.11490v4.pdf","comment":"20 pages, 8 figures"},{"id":"http://arxiv.org/abs/2310.11184v1","updated":"2023-10-17T12:01:32Z","published":"2023-10-17T12:01:32Z","title":"Sparse Multi-Object Render-and-Compare","summary":" Reconstructing 3D shape and pose of static objects from a single image is an\nessential task for various industries, including robotics, augmented reality,\nand digital content creation. This can be done by directly predicting 3D shape\nin various representations or by retrieving CAD models from a database and\npredicting their alignments. Directly predicting 3D shapes often produces\nunrealistic, overly smoothed or tessellated shapes. Retrieving CAD models\nensures realistic shapes but requires robust and accurate alignment. Learning\nto directly predict CAD model poses from image features is challenging and\ninaccurate. Works, such as ROCA, compute poses from predicted normalised object\ncoordinates which can be more accurate but are susceptible to systematic\nfailure. SPARC demonstrates that following a ''render-and-compare'' approach\nwhere a network iteratively improves upon its own predictions achieves accurate\nalignments. Nevertheless, it performs individual CAD alignment for every object\ndetected in an image. This approach is slow when applied to many objects as the\ntime complexity increases linearly with the number of objects and can not learn\ninter-object relations. Introducing a new network architecture Multi-SPARC we\nlearn to perform CAD model alignments for multiple detected objects jointly.\nCompared to other single-view methods we achieve state-of-the-art performance\non the challenging real-world dataset ScanNet. By improving the instance\nalignment accuracy from 31.8% to 40.3% we perform similar to state-of-the-art\nmulti-view methods.\n","authors":["Florian Langer","Ignas Budvytis","Roberto Cipolla"],"pdf_url":"https://arxiv.org/pdf/2310.11184v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11178v1","updated":"2023-10-17T11:53:32Z","published":"2023-10-17T11:53:32Z","title":"FocDepthFormer: Transformer with LSTM for Depth Estimation from Focus","summary":" Depth estimation from focal stacks is a fundamental computer vision problem\nthat aims to infer depth from focus/defocus cues in the image stacks. Most\nexisting methods tackle this problem by applying convolutional neural networks\n(CNNs) with 2D or 3D convolutions over a set of fixed stack images to learn\nfeatures across images and stacks. Their performance is restricted due to the\nlocal properties of the CNNs, and they are constrained to process a fixed\nnumber of stacks consistent in train and inference, limiting the generalization\nto the arbitrary length of stacks. To handle the above limitations, we develop\na novel Transformer-based network, FocDepthFormer, composed mainly of a\nTransformer with an LSTM module and a CNN decoder. The self-attention in\nTransformer enables learning more informative features via an implicit\nnon-local cross reference. The LSTM module is learned to integrate the\nrepresentations across the stack with arbitrary images. To directly capture the\nlow-level features of various degrees of focus/defocus, we propose to use\nmulti-scale convolutional kernels in an early-stage encoder. Benefiting from\nthe design with LSTM, our FocDepthFormer can be pre-trained with abundant\nmonocular RGB depth estimation data for visual pattern capturing, alleviating\nthe demand for the hard-to-collect focal stack data. Extensive experiments on\nvarious focal stack benchmark datasets show that our model outperforms the\nstate-of-the-art models on multiple metrics.\n","authors":["Xueyang Kang","Fengze Han","Abdur Fayjie","Dong Gong"],"pdf_url":"https://arxiv.org/pdf/2310.11178v1.pdf","comment":"20 pages, 18 figures, journal paper"},{"id":"http://arxiv.org/abs/2310.05392v2","updated":"2023-10-17T11:51:56Z","published":"2023-10-09T04:07:35Z","title":"Lightweight Full-Convolutional Siamese Tracker","summary":" Although single object trackers have achieved advanced performance, their\nlarge-scale models make it difficult to apply them on the platforms with\nlimited resources. Moreover, existing lightweight trackers only achieve balance\nbetween 2-3 points in terms of parameters, performance, Flops and FPS. To\nachieve the optimal balance among these points, this paper propose a\nlightweight full-convolutional Siamese tracker called LightFC. LightFC employs\na novel efficient cross-correlation module (ECM) and a novel efficient\nrep-center head (ERH) to enhance the nonlinear expressiveness of the\nconvolutional tracking pipeline. The ECM employs an attention-like module\ndesign, which conducts spatial and channel linear fusion of fused features and\nenhances the nonlinearly of the fused features. Additionally, it references\nsuccessful factors of current lightweight trackers and introduces\nskip-connections and reuse of search area features. The ERH reparameterizes the\nfeature dimensional stage in the standard center head and introduces channel\nattention to optimize the bottleneck of key feature flows. Comprehensive\nexperiments show that LightFC achieves the optimal balance between performance,\nparameters, Flops and FPS. The precision score of LightFC outperforms\nMixFormerV2-S by 3.7 \\% and 6.5 \\% on LaSOT and TNL2K, respectively, while\nusing 5x fewer parameters and 4.6x fewer Flops. Besides, LightFC runs 2x faster\nthan MixFormerV2-S on CPUs. Our code and raw results can be found at\nhttps://github.com/LiYunfengLYF/LightFC\n","authors":["Yunfeng Li","Bo Wang","Xueyi Wu","Zhuoyan Liu","Ye Li"],"pdf_url":"https://arxiv.org/pdf/2310.05392v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.15938v2","updated":"2023-10-17T11:50:03Z","published":"2023-03-28T12:49:10Z","title":"fRegGAN with K-space Loss Regularization for Medical Image Translation","summary":" Generative adversarial networks (GANs) have shown remarkable success in\ngenerating realistic images and are increasingly used in medical imaging for\nimage-to-image translation tasks. However, GANs tend to suffer from a frequency\nbias towards low frequencies, which can lead to the removal of important\nstructures in the generated images. To address this issue, we propose a novel\nfrequency-aware image-to-image translation framework based on the supervised\nRegGAN approach, which we call fRegGAN. The framework employs a K-space loss to\nregularize the frequency content of the generated images and incorporates\nwell-known properties of MRI K-space geometry to guide the network training\nprocess. By combine our method with the RegGAN approach, we can mitigate the\neffect of training with misaligned data and frequency bias at the same time. We\nevaluate our method on the public BraTS dataset and outperform the baseline\nmethods in terms of both quantitative and qualitative metrics when synthesizing\nT2-weighted from T1-weighted MR images. Detailed ablation studies are provided\nto understand the effect of each modification on the final performance. The\nproposed method is a step towards improving the performance of image-to-image\ntranslation and synthesis in the medical domain and shows promise for other\napplications in the field of image processing and generation.\n","authors":["Ivo M. Baltruschat","Felix Kreis","Alexander Hoelscher","Melanie Dohmen","Matthias Lenga"],"pdf_url":"https://arxiv.org/pdf/2303.15938v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11173v1","updated":"2023-10-17T11:41:38Z","published":"2023-10-17T11:41:38Z","title":"Knowledge Extraction and Distillation from Large-Scale Image-Text\n Colonoscopy Records Leveraging Large Language and Vision Models","summary":" The development of artificial intelligence systems for colonoscopy analysis\noften necessitates expert-annotated image datasets. However, limitations in\ndataset size and diversity impede model performance and generalisation.\nImage-text colonoscopy records from routine clinical practice, comprising\nmillions of images and text reports, serve as a valuable data source, though\nannotating them is labour-intensive. Here we leverage recent advancements in\nlarge language and vision models and propose EndoKED, a data mining paradigm\nfor deep knowledge extraction and distillation. EndoKED automates the\ntransformation of raw colonoscopy records into image datasets with pixel-level\nannotation. We validate EndoKED using multi-centre datasets of raw colonoscopy\nrecords (~1 million images), demonstrating its superior performance in training\npolyp detection and segmentation models. Furthermore, the EndoKED pre-trained\nvision backbone enables data-efficient and generalisable learning for optical\nbiopsy, achieving expert-level performance in both retrospective and\nprospective validation.\n","authors":["Shuo Wang","Yan Zhu","Xiaoyuan Luo","Zhiwei Yang","Yizhe Zhang","Peiyao Fu","Manning Wang","Zhijian Song","Quanlin Li","Pinghong Zhou","Yike Guo"],"pdf_url":"https://arxiv.org/pdf/2310.11173v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11153v1","updated":"2023-10-17T11:19:51Z","published":"2023-10-17T11:19:51Z","title":"Unsupervised Pre-Training Using Masked Autoencoders for ECG Analysis","summary":" Unsupervised learning methods have become increasingly important in deep\nlearning due to their demonstrated large utilization of datasets and higher\naccuracy in computer vision and natural language processing tasks. There is a\ngrowing trend to extend unsupervised learning methods to other domains, which\nhelps to utilize a large amount of unlabelled data. This paper proposes an\nunsupervised pre-training technique based on masked autoencoder (MAE) for\nelectrocardiogram (ECG) signals. In addition, we propose a task-specific\nfine-tuning to form a complete framework for ECG analysis. The framework is\nhigh-level, universal, and not individually adapted to specific model\narchitectures or tasks. Experiments are conducted using various model\narchitectures and large-scale datasets, resulting in an accuracy of 94.39% on\nthe MITDB dataset for ECG arrhythmia classification task. The result shows a\nbetter performance for the classification of previously unseen data for the\nproposed approach compared to fully supervised methods.\n","authors":["Guoxin Wang","Qingyuan Wang","Ganesh Neelakanta Iyer","Avishek Nag","Deepu John"],"pdf_url":"https://arxiv.org/pdf/2310.11153v1.pdf","comment":"Accepted by IEEE Biomedical Circuits and Systems (BIOCAS) 2023"},{"id":"http://arxiv.org/abs/2209.02397v2","updated":"2023-10-17T11:09:43Z","published":"2022-09-06T11:15:58Z","title":"A Scene-Text Synthesis Engine Achieved Through Learning from Decomposed\n Real-World Data","summary":" Scene-text image synthesis techniques that aim to naturally compose text\ninstances on background scene images are very appealing for training deep\nneural networks due to their ability to provide accurate and comprehensive\nannotation information. Prior studies have explored generating synthetic text\nimages on two-dimensional and three-dimensional surfaces using rules derived\nfrom real-world observations. Some of these studies have proposed generating\nscene-text images through learning; however, owing to the absence of a suitable\ntraining dataset, unsupervised frameworks have been explored to learn from\nexisting real-world data, which might not yield reliable performance. To ease\nthis dilemma and facilitate research on learning-based scene text synthesis, we\nintroduce DecompST, a real-world dataset prepared from some public benchmarks,\ncontaining three types of annotations: quadrilateral-level BBoxes, stroke-level\ntext masks, and text-erased images. Leveraging the DecompST dataset, we propose\na Learning-Based Text Synthesis engine (LBTS) that includes a text location\nproposal network (TLPNet) and a text appearance adaptation network (TAANet).\nTLPNet first predicts the suitable regions for text embedding, after which\nTAANet adaptively adjusts the geometry and color of the text instance to match\nthe background context. After training, those networks can be integrated and\nutilized to generate the synthetic dataset for scene text analysis tasks.\nComprehensive experiments were conducted to validate the effectiveness of the\nproposed LBTS along with existing methods, and the experimental results\nindicate the proposed LBTS can generate better pretraining data for scene text\ndetectors.\n","authors":["Zhengmi Tang","Tomo Miyazaki","Shinichiro Omachi"],"pdf_url":"https://arxiv.org/pdf/2209.02397v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11142v1","updated":"2023-10-17T10:45:28Z","published":"2023-10-17T10:45:28Z","title":"BayesDiff: Estimating Pixel-wise Uncertainty in Diffusion via Bayesian\n Inference","summary":" Diffusion models have impressive image generation capability, but low-quality\ngenerations still exist, and their identification remains challenging due to\nthe lack of a proper sample-wise metric. To address this, we propose BayesDiff,\na pixel-wise uncertainty estimator for generations from diffusion models based\non Bayesian inference. In particular, we derive a novel uncertainty iteration\nprinciple to characterize the uncertainty dynamics in diffusion, and leverage\nthe last-layer Laplace approximation for efficient Bayesian inference. The\nestimated pixel-wise uncertainty can not only be aggregated into a sample-wise\nmetric to filter out low-fidelity images but also aids in augmenting successful\ngenerations and rectifying artifacts in failed generations in text-to-image\ntasks. Extensive experiments demonstrate the efficacy of BayesDiff and its\npromise for practical applications.\n","authors":["Siqi Kou","Lei Gan","Dequan Wang","Chongxuan Li","Zhijie Deng"],"pdf_url":"https://arxiv.org/pdf/2310.11142v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.13035v3","updated":"2023-10-17T10:23:46Z","published":"2023-05-22T13:39:28Z","title":"Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design","summary":" Scaling laws have been recently employed to derive compute-optimal model size\n(number of parameters) for a given compute duration. We advance and refine such\nmethods to infer compute-optimal model shapes, such as width and depth, and\nsuccessfully implement this in vision transformers. Our shape-optimized vision\ntransformer, SoViT, achieves results competitive with models that exceed twice\nits size, despite being pre-trained with an equivalent amount of compute. For\nexample, SoViT-400m/14 achieves 90.3% fine-tuning accuracy on ILSRCV2012,\nsurpassing the much larger ViT-g/14 and approaching ViT-G/14 under identical\nsettings, with also less than half the inference cost. We conduct a thorough\nevaluation across multiple tasks, such as image classification, captioning, VQA\nand zero-shot transfer, demonstrating the effectiveness of our model across a\nbroad range of domains and identifying limitations. Overall, our findings\nchallenge the prevailing approach of blindly scaling up vision models and pave\na path for a more informed scaling.\n","authors":["Ibrahim Alabdulmohsin","Xiaohua Zhai","Alexander Kolesnikov","Lucas Beyer"],"pdf_url":"https://arxiv.org/pdf/2305.13035v3.pdf","comment":"10 pages, 7 figures, 9 tables. Version 2: Layout fixes"},{"id":"http://arxiv.org/abs/2310.11117v1","updated":"2023-10-17T10:04:47Z","published":"2023-10-17T10:04:47Z","title":"USDC: Unified Static and Dynamic Compression for Visual Transformer","summary":" Visual Transformers have achieved great success in almost all vision tasks,\nsuch as classification, detection, and so on. However, the model complexity and\nthe inference speed of the visual transformers hinder their deployments in\nindustrial products. Various model compression techniques focus on directly\ncompressing the visual transformers into a smaller one while maintaining the\nmodel performance, however, the performance drops dramatically when the\ncompression ratio is large. Furthermore, several dynamic network techniques\nhave also been applied to dynamically compress the visual transformers to\nobtain input-adaptive efficient sub-structures during the inference stage,\nwhich can achieve a better trade-off between the compression ratio and the\nmodel performance. The upper bound of memory of dynamic models is not reduced\nin the practical deployment since the whole original visual transformer model\nand the additional control gating modules should be loaded onto devices\ntogether for inference. To alleviate two disadvantages of two categories of\nmethods, we propose to unify the static compression and dynamic compression\ntechniques jointly to obtain an input-adaptive compressed model, which can\nfurther better balance the total compression ratios and the model performances.\nMoreover, in practical deployment, the batch sizes of the training and\ninference stage are usually different, which will cause the model inference\nperformance to be worse than the model training performance, which is not\ntouched by all previous dynamic network papers. We propose a sub-group gates\naugmentation technique to solve this performance drop problem. Extensive\nexperiments demonstrate the superiority of our method on various baseline\nvisual transformers such as DeiT, T2T-ViT, and so on.\n","authors":["Huan Yuan","Chao Liao","Jianchao Tan","Peng Yao","Jiyuan Jia","Bin Chen","Chengru Song","Di Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.11117v1.pdf","comment":"This paper was actually finished in 2021"},{"id":"http://arxiv.org/abs/2210.11443v2","updated":"2023-10-17T09:54:20Z","published":"2022-10-20T17:45:22Z","title":"Snapshot of Algebraic Vision","summary":" In this survey article, we present interactions between algebraic geometry\nand computer vision, which have recently come under the header of algebraic\nvision. The subject has given new insights in multiple view geometry and its\napplication to 3D scene reconstruction and carried a host of novel problems and\nideas back into algebraic geometry.\n","authors":["Joe Kileel","Kathlén Kohn"],"pdf_url":"https://arxiv.org/pdf/2210.11443v2.pdf","comment":"v2: incorporated referees' suggestions"},{"id":"http://arxiv.org/abs/2310.10123v2","updated":"2023-10-17T09:54:02Z","published":"2023-10-16T07:00:32Z","title":"AutoDIR: Automatic All-in-One Image Restoration with Latent Diffusion","summary":" In this paper, we aim to solve complex real-world image restoration\nsituations, in which, one image may have a variety of unknown degradations. To\nthis end, we propose an all-in-one image restoration framework with latent\ndiffusion (AutoDIR), which can automatically detect and address multiple\nunknown degradations. Our framework first utilizes a Blind Image Quality\nAssessment Module (BIQA) to automatically detect and identify the unknown\ndominant image degradation type of the image. Then, an All-in-One Image\nRefinement (AIR) Module handles multiple kinds of degradation image restoration\nwith the guidance of BIQA. Finally, a Structure Correction Module (SCM) is\nproposed to recover the image details distorted by AIR. Our comprehensive\nevaluation demonstrates that AutoDIR outperforms state-of-the-art approaches by\nachieving superior restoration results while supporting a wider range of tasks.\nNotably, AutoDIR is also the first method to automatically handle real-scenario\nimages with multiple unknown degradations.\n","authors":["Yitong Jiang","Zhaoyang Zhang","Tianfan Xue","Jinwei Gu"],"pdf_url":"https://arxiv.org/pdf/2310.10123v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11112v1","updated":"2023-10-17T09:52:54Z","published":"2023-10-17T09:52:54Z","title":"Super resolution of histopathological frozen sections via deep learning\n preserving tissue structure","summary":" Histopathology plays a pivotal role in medical diagnostics. In contrast to\npreparing permanent sections for histopathology, a time-consuming process,\npreparing frozen sections is significantly faster and can be performed during\nsurgery, where the sample scanning time should be optimized. Super-resolution\ntechniques allow imaging the sample in lower magnification and sparing scanning\ntime. In this paper, we present a new approach to super resolution for\nhistopathological frozen sections, with focus on achieving better distortion\nmeasures, rather than pursuing photorealistic images that may compromise\ncritical diagnostic information. Our deep-learning architecture focuses on\nlearning the error between interpolated images and real images, thereby it\ngenerates high-resolution images while preserving critical image details,\nreducing the risk of diagnostic misinterpretation. This is done by leveraging\nthe loss functions in the frequency domain, assigning higher weights to the\nreconstruction of complex, high-frequency components. In comparison to existing\nmethods, we obtained significant improvements in terms of Structural Similarity\nIndex (SSIM) and Peak Signal-to-Noise Ratio (PSNR), as well as indicated\ndetails that lost in the low-resolution frozen-section images, affecting the\npathologist's clinical decisions. Our approach has a great potential in\nproviding more-rapid frozen-section imaging, with less scanning, while\npreserving the high resolution in the imaged sample.\n","authors":["Elad Yoshai","Gil Goldinger","Miki Haifler","Natan T. Shaked"],"pdf_url":"https://arxiv.org/pdf/2310.11112v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11106v1","updated":"2023-10-17T09:44:30Z","published":"2023-10-17T09:44:30Z","title":"3D Structure-guided Network for Tooth Alignment in 2D Photograph","summary":" Orthodontics focuses on rectifying misaligned teeth (i.e., malocclusions),\naffecting both masticatory function and aesthetics. However, orthodontic\ntreatment often involves complex, lengthy procedures. As such, generating a 2D\nphotograph depicting aligned teeth prior to orthodontic treatment is crucial\nfor effective dentist-patient communication and, more importantly, for\nencouraging patients to accept orthodontic intervention. In this paper, we\npropose a 3D structure-guided tooth alignment network that takes 2D photographs\nas input (e.g., photos captured by smartphones) and aligns the teeth within the\n2D image space to generate an orthodontic comparison photograph featuring\naesthetically pleasing, aligned teeth. Notably, while the process operates\nwithin a 2D image space, our method employs 3D intra-oral scanning models\ncollected in clinics to learn about orthodontic treatment, i.e., projecting the\npre- and post-orthodontic 3D tooth structures onto 2D tooth contours, followed\nby a diffusion model to learn the mapping relationship. Ultimately, the aligned\ntooth contours are leveraged to guide the generation of a 2D photograph with\naesthetically pleasing, aligned teeth and realistic textures. We evaluate our\nnetwork on various facial photographs, demonstrating its exceptional\nperformance and strong applicability within the orthodontic industry.\n","authors":["Yulong Dou","Lanzhuju Mei","Dinggang Shen","Zhiming Cui"],"pdf_url":"https://arxiv.org/pdf/2310.11106v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11105v1","updated":"2023-10-17T09:39:53Z","published":"2023-10-17T09:39:53Z","title":"Generalizability of CNN Architectures for Face Morph Presentation Attack","summary":" Automatic border control systems are wide spread in modern airports\nworldwide. Morphing attacks on face biometrics is a serious threat that\nundermines the security and reliability of face recognition systems deployed in\nairports and border controls. Therefore, developing a robust Machine Learning\n(ML) system is necessary to prevent criminals crossing borders with fake\nidentifications especially since it has been shown that security officers\ncannot detect morphs better than machines. In this study, we investigate the\ngeneralization power of Convolutional Neural Network (CNN) architectures\nagainst morphing attacks. The investigation utilizes 5 distinct CNNs namely\nShuffleNet, DenseNet201, VGG16, EffecientNet-B0 and InceptionResNet-v2. Each\nCNN architecture represents a well-known family of CNN models in terms of\nnumber of parameters, architectural design and performance across various\ncomputer vision applications. To ensure robust evaluation, we employ 4\ndifferent datasets (Utrecht, London, Defacto and KurdFace) that contain a\ndiverse range of digital face images which cover variations in ethnicity,\ngender, age, lighting condition and camera setting. One of the fundamental\nconcepts of ML system design is the ability to generalize effectively to\npreviously unseen data, hence not only we evaluate the performance of CNN\nmodels within individual datasets but also explore their performance across\ncombined datasets and investigating each dataset in testing phase only.\nExperimental results on more than 8 thousand images (genuine and morph) from\nthe 4 datasets show that InceptionResNet-v2 generalizes better to unseen data\nand outperforms the other 4 CNN models.\n","authors":["Sherko R. HmaSalah","Aras Asaad"],"pdf_url":"https://arxiv.org/pdf/2310.11105v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11093v1","updated":"2023-10-17T09:22:20Z","published":"2023-10-17T09:22:20Z","title":"SODA: Robust Training of Test-Time Data Adaptors","summary":" Adapting models deployed to test distributions can mitigate the performance\ndegradation caused by distribution shifts. However, privacy concerns may render\nmodel parameters inaccessible. One promising approach involves utilizing\nzeroth-order optimization (ZOO) to train a data adaptor to adapt the test data\nto fit the deployed models. Nevertheless, the data adaptor trained with ZOO\ntypically brings restricted improvements due to the potential corruption of\ndata features caused by the data adaptor. To address this issue, we revisit ZOO\nin the context of test-time data adaptation. We find that the issue directly\nstems from the unreliable estimation of the gradients used to optimize the data\nadaptor, which is inherently due to the unreliable nature of the pseudo-labels\nassigned to the test data. Based on this observation, we propose\npseudo-label-robust data adaptation (SODA) to improve the performance of data\nadaptation. Specifically, SODA leverages high-confidence predicted labels as\nreliable labels to optimize the data adaptor with ZOO for label prediction. For\ndata with low-confidence predictions, SODA encourages the adaptor to preserve\ndata information to mitigate data corruption. Empirical results indicate that\nSODA can significantly enhance the performance of deployed models in the\npresence of distribution shifts without requiring access to model parameters.\n","authors":["Zige Wang","Yonggang Zhang","Zhen Fang","Long Lan","Wenjing Yang","Bo Han"],"pdf_url":"https://arxiv.org/pdf/2310.11093v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2209.09483v3","updated":"2023-10-17T09:22:01Z","published":"2022-09-20T05:52:28Z","title":"Diffusion Unit: Interpretable Edge Enhancement and Suppression Learning\n for 3D Point Cloud Segmentation","summary":" 3D point clouds are discrete samples of continuous surfaces which can be used\nfor various applications. However, the lack of true connectivity information,\ni.e., edge information, makes point cloud recognition challenging. Recent\nedge-aware methods incorporate edge modeling into network designs to better\ndescribe local structures. Although these methods show that incorporating edge\ninformation is beneficial, how edge information helps remains unclear, making\nit difficult for users to analyze its usefulness. To shed light on this issue,\nin this study, we propose a new algorithm called Diffusion Unit (DU) that\nhandles edge information in a principled and interpretable manner while\nproviding decent improvement. First, we theoretically show that DU learns to\nperform task-beneficial edge enhancement and suppression. Second, we\nexperimentally observe and verify the edge enhancement and suppression\nbehavior. Third, we empirically demonstrate that this behavior contributes to\nperformance improvement. Extensive experiments and analyses performed on\nchallenging benchmarks verify the effectiveness of DU. Specifically, our method\nachieves state-of-the-art performance in object part segmentation using\nShapeNet part and scene segmentation using S3DIS. Our source code is available\nat https://github.com/martianxiu/DiffusionUnit.\n","authors":["Haoyi Xiu","Xin Liu","Weimin Wang","Kyoung-Sook Kim","Takayuki Shinohara","Qiong Chang","Masashi Matsuoka"],"pdf_url":"https://arxiv.org/pdf/2209.09483v3.pdf","comment":"Neurocomputing"},{"id":"http://arxiv.org/abs/2310.11092v1","updated":"2023-10-17T09:21:29Z","published":"2023-10-17T09:21:29Z","title":"DORec: Decomposed Object Reconstruction Utilizing 2D Self-Supervised\n Features","summary":" Decomposing a target object from a complex background while reconstructing is\nchallenging. Most approaches acquire the perception for object instances\nthrough the use of manual labels, but the annotation procedure is costly. The\nrecent advancements in 2D self-supervised learning have brought new prospects\nto object-aware representation, yet it remains unclear how to leverage such\nnoisy 2D features for clean decomposition. In this paper, we propose a\nDecomposed Object Reconstruction (DORec) network based on neural implicit\nrepresentations. Our key idea is to transfer 2D self-supervised features into\nmasks of two levels of granularity to supervise the decomposition, including a\nbinary mask to indicate the foreground regions and a K-cluster mask to indicate\nthe semantically similar regions. These two masks are complementary to each\nother and lead to robust decomposition. Experimental results show the\nsuperiority of DORec in segmenting and reconstructing the foreground object on\nvarious datasets.\n","authors":["Jun Wu","Sicheng Li","Sihui Ji","Yue Wang","Rong Xiong","Yiyi Liao"],"pdf_url":"https://arxiv.org/pdf/2310.11092v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.04238v4","updated":"2023-10-17T09:16:06Z","published":"2023-03-07T21:03:48Z","title":"Patch of Invisibility: Naturalistic Physical Black-Box Adversarial\n Attacks on Object Detectors","summary":" Adversarial attacks on deep-learning models have been receiving increased\nattention in recent years. Work in this area has mostly focused on\ngradient-based techniques, so-called ``white-box'' attacks, wherein the\nattacker has access to the targeted model's internal parameters; such an\nassumption is usually unrealistic in the real world. Some attacks additionally\nuse the entire pixel space to fool a given model, which is neither practical\nnor physical (i.e., real-world). On the contrary, we propose herein a direct,\nblack-box, gradient-free method that uses the learned image manifold of a\npretrained generative adversarial network (GAN) to generate naturalistic\nphysical adversarial patches for object detectors. To our knowledge this is the\nfirst and only method that performs black-box physical attacks directly on\nobject-detection models, which results with a model-agnostic attack. We show\nthat our proposed method works both digitally and physically. We compared our\napproach against four different black-box attacks with different\nconfigurations. Our approach outperformed all other approaches that were tested\nin our experiments by a large margin.\n","authors":["Raz Lapid","Eylon Mizrahi","Moshe Sipper"],"pdf_url":"https://arxiv.org/pdf/2303.04238v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07184v2","updated":"2023-10-17T09:00:22Z","published":"2023-10-11T04:20:32Z","title":"NeuroInspect: Interpretable Neuron-based Debugging Framework through\n Class-conditional Visualizations","summary":" Despite deep learning (DL) has achieved remarkable progress in various\ndomains, the DL models are still prone to making mistakes. This issue\nnecessitates effective debugging tools for DL practitioners to interpret the\ndecision-making process within the networks. However, existing debugging\nmethods often demand extra data or adjustments to the decision process,\nlimiting their applicability. To tackle this problem, we present NeuroInspect,\nan interpretable neuron-based debugging framework with three key stages:\ncounterfactual explanations, feature visualizations, and false correlation\nmitigation. Our debugging framework first pinpoints neurons responsible for\nmistakes in the network and then visualizes features embedded in the neurons to\nbe human-interpretable. To provide these explanations, we introduce\nCLIP-Illusion, a novel feature visualization method that generates images\nrepresenting features conditioned on classes to examine the connection between\nneurons and the decision layer. We alleviate convoluted explanations of the\nconventional visualization approach by employing class information, thereby\nisolating mixed properties. This process offers more human-interpretable\nexplanations for model errors without altering the trained network or requiring\nadditional data. Furthermore, our framework mitigates false correlations\nlearned from a dataset under a stochastic perspective, modifying decisions for\nthe neurons considered as the main causes. We validate the effectiveness of our\nframework by addressing false correlations and improving inferences for classes\nwith the worst performance in real-world settings. Moreover, we demonstrate\nthat NeuroInspect helps debug the mistakes of DL models through evaluation for\nhuman understanding. The code is openly available at\nhttps://github.com/yeongjoonJu/NeuroInspect.\n","authors":["Yeong-Joon Ju","Ji-Hoon Park","Seong-Whan Lee"],"pdf_url":"https://arxiv.org/pdf/2310.07184v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.15010v3","updated":"2023-10-17T08:57:30Z","published":"2023-06-26T18:49:09Z","title":"Efficient High-Resolution Template Matching with Vector Quantized\n Nearest Neighbour Fields","summary":" Template matching is a fundamental problem in computer vision with\napplications in fields including object detection, image registration, and\nobject tracking. Current methods rely on nearest-neighbour (NN) matching, where\nthe query feature space is converted to NN space by representing each query\npixel with its NN in the template. NN-based methods have been shown to perform\nbetter in occlusions, appearance changes, and non-rigid transformations;\nhowever, they scale poorly with high-resolution data and high feature\ndimensions. We present an NN-based method which efficiently reduces the NN\ncomputations and introduces filtering in the NN fields (NNFs). A vector\nquantization step is introduced before the NN calculation to represent the\ntemplate with $k$ features, and the filter response over the NNFs is used to\ncompare the template and query distributions over the features. We show that\nstate-of-the-art performance is achieved in low-resolution data, and our method\noutperforms previous methods at higher resolution.\n","authors":["Ankit Gupta","Ida-Maria Sintorn"],"pdf_url":"https://arxiv.org/pdf/2306.15010v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11077v1","updated":"2023-10-17T08:51:44Z","published":"2023-10-17T08:51:44Z","title":"United We Stand: Using Epoch-wise Agreement of Ensembles to Combat\n Overfit","summary":" Deep neural networks have become the method of choice for solving many image\nclassification tasks, largely because they can fit very complex functions\ndefined over raw images. The downside of such powerful learners is the danger\nof overfitting the training set, leading to poor generalization, which is\nusually avoided by regularization and \"early stopping\" of the training. In this\npaper, we propose a new deep network ensemble classifier that is very effective\nagainst overfit. We begin with the theoretical analysis of a regression model,\nwhose predictions - that the variance among classifiers increases when overfit\noccurs - is demonstrated empirically in deep networks in common use. Guided by\nthese results, we construct a new ensemble-based prediction method designed to\ncombat overfit, where the prediction is determined by the most consensual\nprediction throughout the training. On multiple image and text classification\ndatasets, we show that when regular ensembles suffer from overfit, our method\neliminates the harmful reduction in generalization due to overfit, and often\neven surpasses the performance obtained by early stopping. Our method is easy\nto implement, and can be integrated with any training scheme and architecture,\nwithout additional prior knowledge beyond the training set. Accordingly, it is\na practical and useful tool to overcome overfit.\n","authors":["Uri Stern","Daniel Shwartz","Daphna Weinshall"],"pdf_url":"https://arxiv.org/pdf/2310.11077v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.02569v2","updated":"2023-10-17T08:11:15Z","published":"2023-10-04T04:07:37Z","title":"ReForm-Eval: Evaluating Large Vision Language Models via Unified\n Re-Formulation of Task-Oriented Benchmarks","summary":" Recent years have witnessed remarkable progress in the development of large\nvision-language models (LVLMs). Benefiting from the strong language backbones\nand efficient cross-modal alignment strategies, LVLMs exhibit surprising\ncapabilities to perceive visual signals and perform visually grounded\nreasoning. However, the capabilities of LVLMs have not been comprehensively and\nquantitatively evaluate. Most existing multi-modal benchmarks require\ntask-oriented input-output formats, posing great challenges to automatically\nassess the free-form text output of LVLMs. To effectively leverage the\nannotations available in existing benchmarks and reduce the manual effort\nrequired for constructing new benchmarks, we propose to re-formulate existing\nbenchmarks into unified LVLM-compatible formats. Through systematic data\ncollection and reformulation, we present the ReForm-Eval benchmark, offering\nsubstantial data for evaluating various capabilities of LVLMs. Based on\nReForm-Eval, we conduct extensive experiments, thoroughly analyze the strengths\nand weaknesses of existing LVLMs, and identify the underlying factors. Our\nbenchmark and evaluation framework will be open-sourced as a cornerstone for\nadvancing the development of LVLMs.\n","authors":["Zejun Li","Ye Wang","Mengfei Du","Qingwen Liu","Binhao Wu","Jiwen Zhang","Chengxing Zhou","Zhihao Fan","Jie Fu","Jingjing Chen","Xuanjing Huang","Zhongyu Wei"],"pdf_url":"https://arxiv.org/pdf/2310.02569v2.pdf","comment":"38 pages, 11 figures, 24 tables"},{"id":"http://arxiv.org/abs/2306.04607v5","updated":"2023-10-17T07:51:55Z","published":"2023-06-07T17:17:58Z","title":"GeoDiffusion: Text-Prompted Geometric Control for Object Detection Data\n Generation","summary":" Diffusion models have attracted significant attention due to the remarkable\nability to create content and generate data for tasks like image\nclassification. However, the usage of diffusion models to generate the\nhigh-quality object detection data remains an underexplored area, where not\nonly image-level perceptual quality but also geometric conditions such as\nbounding boxes and camera views are essential. Previous studies have utilized\neither copy-paste synthesis or layout-to-image (L2I) generation with\nspecifically designed modules to encode semantic layouts. In this paper, we\npropose GeoDiffusion, a simple framework that can flexibly translate various\ngeometric conditions into text prompts and empower pre-trained text-to-image\n(T2I) diffusion models for high-quality detection data generation. Unlike\nprevious L2I methods, our GeoDiffusion is able to encode not only the bounding\nboxes but also extra geometric conditions such as camera views in self-driving\nscenes. Extensive experiments demonstrate GeoDiffusion outperforms previous L2I\nmethods while maintaining 4x training time faster. To the best of our\nknowledge, this is the first work to adopt diffusion models for layout-to-image\ngeneration with geometric conditions and demonstrate that L2I-generated images\ncan be beneficial for improving the performance of object detectors.\n","authors":["Kai Chen","Enze Xie","Zhe Chen","Yibo Wang","Lanqing Hong","Zhenguo Li","Dit-Yan Yeung"],"pdf_url":"https://arxiv.org/pdf/2306.04607v5.pdf","comment":"Project Page: https://kaichen1998.github.io/projects/geodiffusion/"},{"id":"http://arxiv.org/abs/2306.02602v2","updated":"2023-10-17T07:51:10Z","published":"2023-06-05T05:21:15Z","title":"ReContrast: Domain-Specific Anomaly Detection via Contrastive\n Reconstruction","summary":" Most advanced unsupervised anomaly detection (UAD) methods rely on modeling\nfeature representations of frozen encoder networks pre-trained on large-scale\ndatasets, e.g. ImageNet. However, the features extracted from the encoders that\nare borrowed from natural image domains coincide little with the features\nrequired in the target UAD domain, such as industrial inspection and medical\nimaging. In this paper, we propose a novel epistemic UAD method, namely\nReContrast, which optimizes the entire network to reduce biases towards the\npre-trained image domain and orients the network in the target domain. We start\nwith a feature reconstruction approach that detects anomalies from errors.\nEssentially, the elements of contrastive learning are elegantly embedded in\nfeature reconstruction to prevent the network from training instability,\npattern collapse, and identical shortcut, while simultaneously optimizing both\nthe encoder and decoder on the target domain. To demonstrate our transfer\nability on various image domains, we conduct extensive experiments across two\npopular industrial defect detection benchmarks and three medical image UAD\ntasks, which shows our superiority over current state-of-the-art methods.\n","authors":["Jia Guo","Shuai Lu","Lize Jia","Weihang Zhang","Huiqi Li"],"pdf_url":"https://arxiv.org/pdf/2306.02602v2.pdf","comment":"NeurIPS 2023 Poster"},{"id":"http://arxiv.org/abs/2305.12961v2","updated":"2023-10-17T07:44:33Z","published":"2023-05-22T12:11:07Z","title":"Enhanced Meta Label Correction for Coping with Label Corruption","summary":" Traditional methods for learning with the presence of noisy labels have\nsuccessfully handled datasets with artificially injected noise but still fall\nshort of adequately handling real-world noise. With the increasing use of\nmeta-learning in the diverse fields of machine learning, researchers leveraged\nauxiliary small clean datasets to meta-correct the training labels.\nNonetheless, existing meta-label correction approaches are not fully exploiting\ntheir potential. In this study, we propose an Enhanced Meta Label Correction\napproach abbreviated as EMLC for the learning with noisy labels (LNL) problem.\nWe re-examine the meta-learning process and introduce faster and more accurate\nmeta-gradient derivations. We propose a novel teacher architecture tailored\nexplicitly to the LNL problem, equipped with novel training objectives. EMLC\noutperforms prior approaches and achieves state-of-the-art results in all\nstandard benchmarks. Notably, EMLC enhances the previous art on the noisy\nreal-world dataset Clothing1M by $1.52\\%$ while requiring $\\times 0.5$ the time\nper epoch and with much faster convergence of the meta-objective when compared\nto the baseline approach.\n","authors":["Mitchell Keren Taraday","Chaim Baskin"],"pdf_url":"https://arxiv.org/pdf/2305.12961v2.pdf","comment":"Accepted to ICCV 2023"},{"id":"http://arxiv.org/abs/2310.11050v1","updated":"2023-10-17T07:37:32Z","published":"2023-10-17T07:37:32Z","title":"$k$-$t$ CLAIR: Self-Consistency Guided Multi-Prior Learning for Dynamic\n Parallel MR Image Reconstruction","summary":" Cardiac magnetic resonance imaging (CMR) has been widely used in clinical\npractice for the medical diagnosis of cardiac diseases. However, the long\nacquisition time hinders its development in real-time applications. Here, we\npropose a novel self-consistency guided multi-prior learning framework named\n$k$-$t$ CLAIR to exploit spatiotemporal correlations from highly undersampled\ndata for accelerated dynamic parallel MRI reconstruction. The $k$-$t$ CLAIR\nprogressively reconstructs faithful images by leveraging multiple complementary\npriors learned in the $x$-$t$, $x$-$f$, and $k$-$t$ domains in an iterative\nfashion, as dynamic MRI exhibits high spatiotemporal redundancy. Additionally,\n$k$-$t$ CLAIR incorporates calibration information for prior learning,\nresulting in a more consistent reconstruction. Experimental results on cardiac\ncine and T1W/T2W images demonstrate that $k$-$t$ CLAIR achieves high-quality\ndynamic MR reconstruction in terms of both quantitative and qualitative\nperformance.\n","authors":["Liping Zhang","Weitian Chen"],"pdf_url":"https://arxiv.org/pdf/2310.11050v1.pdf","comment":"12 pages, 3 figures, 4 tables. CMRxRecon Challenge, MICCAI 2023"},{"id":"http://arxiv.org/abs/2310.11040v1","updated":"2023-10-17T07:13:28Z","published":"2023-10-17T07:13:28Z","title":"Co-Learning Semantic-aware Unsupervised Segmentation for Pathological\n Image Registration","summary":" The registration of pathological images plays an important role in medical\napplications. Despite its significance, most researchers in this field\nprimarily focus on the registration of normal tissue into normal tissue. The\nnegative impact of focal tissue, such as the loss of spatial correspondence\ninformation and the abnormal distortion of tissue, are rarely considered. In\nthis paper, we propose GIRNet, a novel unsupervised approach for pathological\nimage registration by incorporating segmentation and inpainting through the\nprinciples of Generation, Inpainting, and Registration (GIR). The registration,\nsegmentation, and inpainting modules are trained simultaneously in a\nco-learning manner so that the segmentation of the focal area and the\nregistration of inpainted pairs can improve collaboratively. Overall, the\nregistration of pathological images is achieved in a completely unsupervised\nlearning framework. Experimental results on multiple datasets, including\nMagnetic Resonance Imaging (MRI) of T1 sequences, demonstrate the efficacy of\nour proposed method. Our results show that our method can accurately achieve\nthe registration of pathological images and identify lesions even in\nchallenging imaging modalities. Our unsupervised approach offers a promising\nsolution for the efficient and cost-effective registration of pathological\nimages. Our code is available at\nhttps://github.com/brain-intelligence-lab/GIRNet.\n","authors":["Yang Liu","Shi Gu"],"pdf_url":"https://arxiv.org/pdf/2310.11040v1.pdf","comment":"13 pages, 7 figures, published in Medical Image Computing and\n Computer Assisted Intervention (MICCAI) 2023"},{"id":"http://arxiv.org/abs/2310.11031v1","updated":"2023-10-17T07:01:24Z","published":"2023-10-17T07:01:24Z","title":"Domain Generalization Using Large Pretrained Models with\n Mixture-of-Adapters","summary":" Learning a robust vision model despite large distribution shift is essential\nfor model deployment in real-world settings. Especially, domain generalization\n(DG) algorithm aims to maintain the performance of a trained model on different\ndistributions which were not seen during training. One of the most effective\nmethods has been leveraging the already learned rich knowledge of large\npretrained models. However, naively fine-tuning large models to DG tasks is\noften practically infeasible due to memory limitations, extensive time\nrequirements for training, and the risk of learned knowledge deterioration.\nRecently, parameter-efficient fine-tuning (PEFT) methods have been proposed to\nreduce the high computational cost during training and efficiently adapt large\nmodels to downstream tasks. In this work, for the first time, we find that the\nuse of adapters in PEFT methods not only reduce high computational cost during\ntraining but also serve as an effective regularizer for DG tasks. Surprisingly,\na naive adapter implementation for large models achieve superior performance on\ncommon datasets. However, in situations of large distribution shifts,\nadditional factors such as optimal amount of regularization due to the strength\nof distribution shifts should be considered for a sophisticated adapter\nimplementation. To address this, we propose a mixture-of-expert based adapter\nfine-tuning method, dubbed as mixture-of-adapters (MoA). Specifically, we\nemploy multiple adapters that have varying capacities, and by using learnable\nrouters, we allocate each token to a proper adapter. By using both PEFT and MoA\nmethods, we effectively alleviate the performance deterioration caused by\ndistribution shifts and achieve state-of-the-art performance on diverse DG\nbenchmarks.\n","authors":["Gyuseong Lee","Wooseok Jang","Jin Hyeon Kim","Jaewoo Jung","Seungryong Kim"],"pdf_url":"https://arxiv.org/pdf/2310.11031v1.pdf","comment":"20 pages, 11 figures"},{"id":"http://arxiv.org/abs/2307.09052v3","updated":"2023-10-17T06:56:51Z","published":"2023-07-18T08:06:14Z","title":"Connections between Operator-splitting Methods and Deep Neural Networks\n with Applications in Image Segmentation","summary":" Deep neural network is a powerful tool for many tasks. Understanding why it\nis so successful and providing a mathematical explanation is an important\nproblem and has been one popular research direction in past years. In the\nliterature of mathematical analysis of deep neural networks, a lot of works is\ndedicated to establishing representation theories. How to make connections\nbetween deep neural networks and mathematical algorithms is still under\ndevelopment. In this paper, we give an algorithmic explanation for deep neural\nnetworks, especially in their connections with operator splitting. We show that\nwith certain splitting strategies, operator-splitting methods have the same\nstructure as networks. Utilizing this connection and the Potts model for image\nsegmentation, two networks inspired by operator-splitting methods are proposed.\nThe two networks are essentially two operator-splitting algorithms solving the\nPotts model. Numerical experiments are presented to demonstrate the\neffectiveness of the proposed networks.\n","authors":["Hao Liu","Xue-Cheng Tai","Raymond Chan"],"pdf_url":"https://arxiv.org/pdf/2307.09052v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.10491v2","updated":"2023-10-17T06:24:02Z","published":"2023-09-19T09:59:08Z","title":"DCPT: Darkness Clue-Prompted Tracking in Nighttime UAVs","summary":" Existing nighttime unmanned aerial vehicle (UAV) trackers follow an\n\"Enhance-then-Track\" architecture - first using a light enhancer to brighten\nthe nighttime video, then employing a daytime tracker to locate the object.\nThis separate enhancement and tracking fails to build an end-to-end trainable\nvision system. To address this, we propose a novel architecture called Darkness\nClue-Prompted Tracking (DCPT) that achieves robust UAV tracking at night by\nefficiently learning to generate darkness clue prompts. Without a separate\nenhancer, DCPT directly encodes anti-dark capabilities into prompts using a\ndarkness clue prompter (DCP). Specifically, DCP iteratively learns emphasizing\nand undermining projections for darkness clues. It then injects these learned\nvisual prompts into a daytime tracker with fixed parameters across transformer\nlayers. Moreover, a gated feature aggregation mechanism enables adaptive fusion\nbetween prompts and between prompts and the base model. Extensive experiments\nshow state-of-the-art performance for DCPT on multiple dark scenario\nbenchmarks. The unified end-to-end learning of enhancement and tracking in DCPT\nenables a more trainable system. The darkness clue prompting efficiently\ninjects anti-dark knowledge without extra modules. Code is available at\n\\href{https://github.com/bearyi26/DCPT}{here}.\n","authors":["Jiawen Zhu","Huayi Tang","Zhi-Qi Cheng","Jun-Yan He","Bin Luo","Shihao Qiu","Shengming Li","Huchuan Lu"],"pdf_url":"https://arxiv.org/pdf/2309.10491v2.pdf","comment":"Under review"},{"id":"http://arxiv.org/abs/2310.02523v2","updated":"2023-10-17T06:13:50Z","published":"2023-10-04T01:47:36Z","title":"A Spatio-Temporal Attention-Based Method for Detecting Student Classroom\n Behaviors","summary":" Accurately detecting student behavior from classroom videos is beneficial for\nanalyzing their classroom status and improving teaching efficiency. However,\nlow accuracy in student classroom behavior detection is a prevalent issue. To\naddress this issue, we propose a Spatio-Temporal Attention-Based Method for\nDetecting Student Classroom Behaviors (BDSTA). Firstly, the SlowFast network is\nused to generate motion and environmental information feature maps from the\nvideo. Then, the spatio-temporal attention module is applied to the feature\nmaps, including information aggregation, compression and stimulation processes.\nSubsequently, attention maps in the time, channel and space dimensions are\nobtained, and multi-label behavior classification is performed based on these\nattention maps. To solve the long-tail data problem that exists in student\nclassroom behavior datasets, we use an improved focal loss function to assign\nmore weight to the tail class data during training. Experimental results are\nconducted on a self-made student classroom behavior dataset named STSCB.\nCompared with the SlowFast model, the average accuracy of student behavior\nclassification detection improves by 8.94\\% using BDSTA.\n","authors":["Fan Yang"],"pdf_url":"https://arxiv.org/pdf/2310.02523v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.11332v2","updated":"2023-10-17T06:10:14Z","published":"2023-03-21T05:50:53Z","title":"Deep Learning for Video-based Person Re-Identification: A Survey","summary":" Video-based person re-identification (video re-ID) has lately fascinated\ngrowing attention due to its broad practical applications in various areas,\nsuch as surveillance, smart city, and public safety. Nevertheless, video re-ID\nis quite difficult and is an ongoing stage due to numerous uncertain challenges\nsuch as viewpoint, occlusion, pose variation, and uncertain video sequence,\netc. In the last couple of years, deep learning on video re-ID has continuously\nachieved surprising results on public datasets, with various approaches being\ndeveloped to handle diverse problems in video re-ID. Compared to image-based\nre-ID, video re-ID is much more challenging and complex. To encourage future\nresearch and challenges, this first comprehensive paper introduces a review of\nup-to-date advancements in deep learning approaches for video re-ID. It broadly\ncovers three important aspects, including brief video re-ID methods with their\nlimitations, major milestones with technical challenges, and architectural\ndesign. It offers comparative performance analysis on various available\ndatasets, guidance to improve video re-ID with valuable thoughts, and exciting\nresearch directions.\n","authors":["Khawar Islam"],"pdf_url":"https://arxiv.org/pdf/2303.11332v2.pdf","comment":"Under Review"},{"id":"http://arxiv.org/abs/2203.01536v5","updated":"2023-10-17T06:05:41Z","published":"2022-03-03T06:17:03Z","title":"Recent Advances in Vision Transformer: A Survey and Outlook of Recent\n Work","summary":" Vision Transformers (ViTs) are becoming more popular and dominating technique\nfor various vision tasks, compare to Convolutional Neural Networks (CNNs). As a\ndemanding technique in computer vision, ViTs have been successfully solved\nvarious vision problems while focusing on long-range relationships. In this\npaper, we begin by introducing the fundamental concepts and background of the\nself-attention mechanism. Next, we provide a comprehensive overview of recent\ntop-performing ViT methods describing in terms of strength and weakness,\ncomputational cost as well as training and testing dataset. We thoroughly\ncompare the performance of various ViT algorithms and most representative CNN\nmethods on popular benchmark datasets. Finally, we explore some limitations\nwith insightful observations and provide further research direction. The\nproject page along with the collections of papers are available at\nhttps://github.com/khawar512/ViT-Survey\n","authors":["Khawar Islam"],"pdf_url":"https://arxiv.org/pdf/2203.01536v5.pdf","comment":"Added AAAI 2022 methods and working on ICLR 2022 methods and ICML\n 2022"},{"id":"http://arxiv.org/abs/2305.14014v2","updated":"2023-10-17T05:39:43Z","published":"2023-05-23T12:51:20Z","title":"CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained\n Vision-Language Model","summary":" Pre-trained vision-language models~(VLMs) are the de-facto foundation models\nfor various downstream tasks. However, scene text recognition methods still\nprefer backbones pre-trained on a single modality, namely, the visual modality,\ndespite the potential of VLMs to serve as powerful scene text readers. For\nexample, CLIP can robustly identify regular (horizontal) and irregular\n(rotated, curved, blurred, or occluded) text in images. With such merits, we\ntransform CLIP into a scene text reader and introduce CLIP4STR, a simple yet\neffective STR method built upon image and text encoders of CLIP. It has two\nencoder-decoder branches: a visual branch and a cross-modal branch. The visual\nbranch provides an initial prediction based on the visual feature, and the\ncross-modal branch refines this prediction by addressing the discrepancy\nbetween the visual feature and text semantics. To fully leverage the\ncapabilities of both branches, we design a dual predict-and-refine decoding\nscheme for inference. CLIP4STR achieves new state-of-the-art performance on 11\nSTR benchmarks. Additionally, a comprehensive empirical study is provided to\nenhance the understanding of the adaptation of CLIP to STR. We believe our\nmethod establishes a simple but strong baseline for future STR research with\nVLMs.\n","authors":["Shuai Zhao","Xiaohan Wang","Linchao Zhu","Ruijie Quan","Yi Yang"],"pdf_url":"https://arxiv.org/pdf/2305.14014v2.pdf","comment":"Preprint, work in progress"},{"id":"http://arxiv.org/abs/2308.08857v2","updated":"2023-10-17T05:27:05Z","published":"2023-08-17T08:31:11Z","title":"D-IF: Uncertainty-aware Human Digitization via Implicit Distribution\n Field","summary":" Realistic virtual humans play a crucial role in numerous industries, such as\nmetaverse, intelligent healthcare, and self-driving simulation. But creating\nthem on a large scale with high levels of realism remains a challenge. The\nutilization of deep implicit function sparks a new era of image-based 3D\nclothed human reconstruction, enabling pixel-aligned shape recovery with fine\ndetails. Subsequently, the vast majority of works locate the surface by\nregressing the deterministic implicit value for each point. However, should all\npoints be treated equally regardless of their proximity to the surface? In this\npaper, we propose replacing the implicit value with an adaptive uncertainty\ndistribution, to differentiate between points based on their distance to the\nsurface. This simple ``value to distribution'' transition yields significant\nimprovements on nearly all the baselines. Furthermore, qualitative results\ndemonstrate that the models trained using our uncertainty distribution loss,\ncan capture more intricate wrinkles, and realistic limbs. Code and models are\navailable for research purposes at https://github.com/psyai-net/D-IF_release.\n","authors":["Xueting Yang","Yihao Luo","Yuliang Xiu","Wei Wang","Hao Xu","Zhaoxin Fan"],"pdf_url":"https://arxiv.org/pdf/2308.08857v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.15818v2","updated":"2023-10-17T04:59:54Z","published":"2023-09-27T17:44:18Z","title":"Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video\n Generation","summary":" Significant advancements have been achieved in the realm of large-scale\npre-trained text-to-video Diffusion Models (VDMs). However, previous methods\neither rely solely on pixel-based VDMs, which come with high computational\ncosts, or on latent-based VDMs, which often struggle with precise text-video\nalignment. In this paper, we are the first to propose a hybrid model, dubbed as\nShow-1, which marries pixel-based and latent-based VDMs for text-to-video\ngeneration. Our model first uses pixel-based VDMs to produce a low-resolution\nvideo of strong text-video correlation. After that, we propose a novel expert\ntranslation method that employs the latent-based VDMs to further upsample the\nlow-resolution video to high resolution. Compared to latent VDMs, Show-1 can\nproduce high-quality videos of precise text-video alignment; Compared to pixel\nVDMs, Show-1 is much more efficient (GPU memory usage during inference is 15G\nvs 72G). We also validate our model on standard video generation benchmarks.\nOur code and model weights are publicly available at\nhttps://github.com/showlab/Show-1.\n","authors":["David Junhao Zhang","Jay Zhangjie Wu","Jia-Wei Liu","Rui Zhao","Lingmin Ran","Yuchao Gu","Difei Gao","Mike Zheng Shou"],"pdf_url":"https://arxiv.org/pdf/2309.15818v2.pdf","comment":"project page is https://showlab.github.io/Show-1"},{"id":"http://arxiv.org/abs/2310.10198v2","updated":"2023-10-17T04:53:30Z","published":"2023-10-16T09:09:02Z","title":"MoConVQ: Unified Physics-Based Motion Control via Scalable Discrete\n Representations","summary":" In this work, we present MoConVQ, a novel unified framework for physics-based\nmotion control leveraging scalable discrete representations. Building upon\nvector quantized variational autoencoders (VQ-VAE) and model-based\nreinforcement learning, our approach effectively learns motion embeddings from\na large, unstructured dataset spanning tens of hours of motion examples. The\nresultant motion representation not only captures diverse motion skills but\nalso offers a robust and intuitive interface for various applications. We\ndemonstrate the versatility of MoConVQ through several applications: universal\ntracking control from various motion sources, interactive character control\nwith latent motion representations using supervised learning, physics-based\nmotion generation from natural language descriptions using the GPT framework,\nand, most interestingly, seamless integration with large language models (LLMs)\nwith in-context learning to tackle complex and abstract tasks.\n","authors":["Heyuan Yao","Zhenhua Song","Yuyang Zhou","Tenglong Ao","Baoquan Chen","Libin Liu"],"pdf_url":"https://arxiv.org/pdf/2310.10198v2.pdf","comment":"Project page: https://pku-mocca.github.io/MoConVQ-page/"},{"id":"http://arxiv.org/abs/2302.04867v4","updated":"2023-10-17T04:13:57Z","published":"2023-02-09T18:59:48Z","title":"UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of\n Diffusion Models","summary":" Diffusion probabilistic models (DPMs) have demonstrated a very promising\nability in high-resolution image synthesis. However, sampling from a\npre-trained DPM is time-consuming due to the multiple evaluations of the\ndenoising network, making it more and more important to accelerate the sampling\nof DPMs. Despite recent progress in designing fast samplers, existing methods\nstill cannot generate satisfying images in many applications where fewer steps\n(e.g., $<$10) are favored. In this paper, we develop a unified corrector (UniC)\nthat can be applied after any existing DPM sampler to increase the order of\naccuracy without extra model evaluations, and derive a unified predictor (UniP)\nthat supports arbitrary order as a byproduct. Combining UniP and UniC, we\npropose a unified predictor-corrector framework called UniPC for the fast\nsampling of DPMs, which has a unified analytical form for any order and can\nsignificantly improve the sampling quality over previous methods, especially in\nextremely few steps. We evaluate our methods through extensive experiments\nincluding both unconditional and conditional sampling using pixel-space and\nlatent-space DPMs. Our UniPC can achieve 3.87 FID on CIFAR10 (unconditional)\nand 7.51 FID on ImageNet 256$\\times$256 (conditional) with only 10 function\nevaluations. Code is available at https://github.com/wl-zhao/UniPC.\n","authors":["Wenliang Zhao","Lujia Bai","Yongming Rao","Jie Zhou","Jiwen Lu"],"pdf_url":"https://arxiv.org/pdf/2302.04867v4.pdf","comment":"Accepted by NeurIPS 2023. Project page:\n https://unipc.ivg-research.xyz"},{"id":"http://arxiv.org/abs/2310.10975v1","updated":"2023-10-17T03:42:12Z","published":"2023-10-17T03:42:12Z","title":"NICE: Improving Panoptic Narrative Detection and Segmentation with\n Cascading Collaborative Learning","summary":" Panoptic Narrative Detection (PND) and Segmentation (PNS) are two challenging\ntasks that involve identifying and locating multiple targets in an image\naccording to a long narrative description. In this paper, we propose a unified\nand effective framework called NICE that can jointly learn these two panoptic\nnarrative recognition tasks. Existing visual grounding tasks use a two-branch\nparadigm, but applying this directly to PND and PNS can result in prediction\nconflict due to their intrinsic many-to-many alignment property. To address\nthis, we introduce two cascading modules based on the barycenter of the mask,\nwhich are Coordinate Guided Aggregation (CGA) and Barycenter Driven\nLocalization (BDL), responsible for segmentation and detection, respectively.\nBy linking PNS and PND in series with the barycenter of segmentation as the\nanchor, our approach naturally aligns the two tasks and allows them to\ncomplement each other for improved performance. Specifically, CGA provides the\nbarycenter as a reference for detection, reducing BDL's reliance on a large\nnumber of candidate boxes. BDL leverages its excellent properties to\ndistinguish different instances, which improves the performance of CGA for\nsegmentation. Extensive experiments demonstrate that NICE surpasses all\nexisting methods by a large margin, achieving 4.1% for PND and 2.9% for PNS\nover the state-of-the-art. These results validate the effectiveness of our\nproposed collaborative learning strategy. The project of this work is made\npublicly available at https://github.com/Mr-Neko/NICE.\n","authors":["Haowei Wang","Jiayi Ji","Tianyu Guo","Yilong Yang","Yiyi Zhou","Xiaoshuai Sun","Rongrong Ji"],"pdf_url":"https://arxiv.org/pdf/2310.10975v1.pdf","comment":"18 pages. 9 figures, 9 tables"},{"id":"http://arxiv.org/abs/2310.01596v2","updated":"2023-10-17T03:41:32Z","published":"2023-10-02T19:41:42Z","title":"ImagenHub: Standardizing the evaluation of conditional image generation\n models","summary":" Recently, a myriad of conditional image generation and editing models have\nbeen developed to serve different downstream tasks, including text-to-image\ngeneration, text-guided image editing, subject-driven image generation,\ncontrol-guided image generation, etc. However, we observe huge inconsistencies\nin experimental conditions: datasets, inference, and evaluation metrics -\nrender fair comparisons difficult. This paper proposes ImagenHub, which is a\none-stop library to standardize the inference and evaluation of all the\nconditional image generation models. Firstly, we define seven prominent tasks\nand curate high-quality evaluation datasets for them. Secondly, we built a\nunified inference pipeline to ensure fair comparison. Thirdly, we design two\nhuman evaluation scores, i.e. Semantic Consistency and Perceptual Quality,\nalong with comprehensive guidelines to evaluate generated images. We train\nexpert raters to evaluate the model outputs based on the proposed metrics. Our\nhuman evaluation achieves a high inter-worker agreement of Krippendorff's alpha\non 76% models with a value higher than 0.4. We comprehensively evaluated a\ntotal of around 30 models and observed three key takeaways: (1) the existing\nmodels' performance is generally unsatisfying except for Text-guided Image\nGeneration and Subject-driven Image Generation, with 74% models achieving an\noverall score lower than 0.5. (2) we examined the claims from published papers\nand found 83% of them hold with a few exceptions. (3) None of the existing\nautomatic metrics has a Spearman's correlation higher than 0.2 except\nsubject-driven image generation. Moving forward, we will continue our efforts\nto evaluate newly published models and update our leaderboard to keep track of\nthe progress in conditional image generation.\n","authors":["Max Ku","Tianle Li","Kai Zhang","Yujie Lu","Xingyu Fu","Wenwen Zhuang","Wenhu Chen"],"pdf_url":"https://arxiv.org/pdf/2310.01596v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09909v2","updated":"2023-10-17T03:41:09Z","published":"2023-10-15T18:32:27Z","title":"Can GPT-4V(ision) Serve Medical Applications? Case Studies on GPT-4V for\n Multimodal Medical Diagnosis","summary":" Driven by the large foundation models, the development of artificial\nintelligence has witnessed tremendous progress lately, leading to a surge of\ngeneral interest from the public. In this study, we aim to assess the\nperformance of OpenAI's newest model, GPT-4V(ision), specifically in the realm\nof multimodal medical diagnosis. Our evaluation encompasses 17 human body\nsystems, including Central Nervous System, Head and Neck, Cardiac, Chest,\nHematology, Hepatobiliary, Gastrointestinal, Urogenital, Gynecology,\nObstetrics, Breast, Musculoskeletal, Spine, Vascular, Oncology, Trauma,\nPediatrics, with images taken from 8 modalities used in daily clinic routine,\ne.g., X-ray, Computed Tomography (CT), Magnetic Resonance Imaging (MRI),\nPositron Emission Tomography (PET), Digital Subtraction Angiography (DSA),\nMammography, Ultrasound, and Pathology. We probe the GPT-4V's ability on\nmultiple clinical tasks with or without patent history provided, including\nimaging modality and anatomy recognition, disease diagnosis, report generation,\ndisease localisation.\n Our observation shows that, while GPT-4V demonstrates proficiency in\ndistinguishing between medical image modalities and anatomy, it faces\nsignificant challenges in disease diagnosis and generating comprehensive\nreports. These findings underscore that while large multimodal models have made\nsignificant advancements in computer vision and natural language processing, it\nremains far from being used to effectively support real-world medical\napplications and clinical decision-making.\n All images used in this report can be found in\nhttps://github.com/chaoyi-wu/GPT-4V_Medical_Evaluation.\n","authors":["Chaoyi Wu","Jiayu Lei","Qiaoyu Zheng","Weike Zhao","Weixiong Lin","Xiaoman Zhang","Xiao Zhou","Ziheng Zhao","Ya Zhang","Yanfeng Wang","Weidi Xie"],"pdf_url":"https://arxiv.org/pdf/2310.09909v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.10533v2","updated":"2023-10-17T03:37:22Z","published":"2023-10-16T15:54:09Z","title":"Label-efficient Segmentation via Affinity Propagation","summary":" Weakly-supervised segmentation with label-efficient sparse annotations has\nattracted increasing research attention to reduce the cost of laborious\npixel-wise labeling process, while the pairwise affinity modeling techniques\nplay an essential role in this task. Most of the existing approaches focus on\nusing the local appearance kernel to model the neighboring pairwise potentials.\nHowever, such a local operation fails to capture the long-range dependencies\nand ignores the topology of objects. In this work, we formulate the affinity\nmodeling as an affinity propagation process, and propose a local and a global\npairwise affinity terms to generate accurate soft pseudo labels. An efficient\nalgorithm is also developed to reduce significantly the computational cost. The\nproposed approach can be conveniently plugged into existing segmentation\nnetworks. Experiments on three typical label-efficient segmentation tasks, i.e.\nbox-supervised instance segmentation, point/scribble-supervised semantic\nsegmentation and CLIP-guided semantic segmentation, demonstrate the superior\nperformance of the proposed approach.\n","authors":["Wentong Li","Yuqian Yuan","Song Wang","Wenyu Liu","Dongqi Tang","Jian Liu","Jianke Zhu","Lei Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.10533v2.pdf","comment":"NeurIPS2023 Acceptance. Project\n Page:https://LiWentomng.github.io/apro/. Code:\n https://github.com/CircleRadon/APro"},{"id":"http://arxiv.org/abs/2310.08872v2","updated":"2023-10-17T03:36:26Z","published":"2023-10-13T05:48:42Z","title":"R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image\n Generation","summary":" Recent text-to-image (T2I) diffusion models have achieved remarkable progress\nin generating high-quality images given text-prompts as input. However, these\nmodels fail to convey appropriate spatial composition specified by a layout\ninstruction. In this work, we probe into zero-shot grounded T2I generation with\ndiffusion models, that is, generating images corresponding to the input layout\ninformation without training auxiliary modules or finetuning diffusion models.\nWe propose a Region and Boundary (R&B) aware cross-attention guidance approach\nthat gradually modulates the attention maps of diffusion model during\ngenerative process, and assists the model to synthesize images (1) with high\nfidelity, (2) highly compatible with textual input, and (3) interpreting layout\ninstructions accurately. Specifically, we leverage the discrete sampling to\nbridge the gap between consecutive attention maps and discrete layout\nconstraints, and design a region-aware loss to refine the generative layout\nduring diffusion process. We further propose a boundary-aware loss to\nstrengthen object discriminability within the corresponding regions.\nExperimental results show that our method outperforms existing state-of-the-art\nzero-shot grounded T2I generation methods by a large margin both qualitatively\nand quantitatively on several benchmarks.\n","authors":["Jiayu Xiao","Liang Li","Henglei Lv","Shuhui Wang","Qingming Huang"],"pdf_url":"https://arxiv.org/pdf/2310.08872v2.pdf","comment":"Preprint. Under review. Project page:\n https://sagileo.github.io/Region-and-Boundary"},{"id":"http://arxiv.org/abs/2310.10971v1","updated":"2023-10-17T03:35:27Z","published":"2023-10-17T03:35:27Z","title":"Context-Aware Meta-Learning","summary":" Large Language Models like ChatGPT demonstrate a remarkable capacity to learn\nnew concepts during inference without any fine-tuning. However, visual models\ntrained to detect new objects during inference have been unable to replicate\nthis ability, and instead either perform poorly or require meta-training and/or\nfine-tuning on similar objects. In this work, we propose a meta-learning\nalgorithm that emulates Large Language Models by learning new visual concepts\nduring inference without fine-tuning. Our approach leverages a frozen\npre-trained feature extractor, and analogous to in-context learning, recasts\nmeta-learning as sequence modeling over datapoints with known labels and a test\ndatapoint with an unknown label. On 8 out of 11 meta-learning benchmarks, our\napproach -- without meta-training or fine-tuning -- exceeds or matches the\nstate-of-the-art algorithm, P>M>F, which is meta-trained on these benchmarks.\n","authors":["Christopher Fifty","Dennis Duan","Ronald G. Junkins","Ehsan Amid","Jure Leskovec","Christopher Ré","Sebastian Thrun"],"pdf_url":"https://arxiv.org/pdf/2310.10971v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.04264v4","updated":"2023-10-17T03:34:15Z","published":"2023-02-08T18:58:00Z","title":"Nerfstudio: A Modular Framework for Neural Radiance Field Development","summary":" Neural Radiance Fields (NeRF) are a rapidly growing area of research with\nwide-ranging applications in computer vision, graphics, robotics, and more. In\norder to streamline the development and deployment of NeRF research, we propose\na modular PyTorch framework, Nerfstudio. Our framework includes plug-and-play\ncomponents for implementing NeRF-based methods, which make it easy for\nresearchers and practitioners to incorporate NeRF into their projects.\nAdditionally, the modular design enables support for extensive real-time\nvisualization tools, streamlined pipelines for importing captured in-the-wild\ndata, and tools for exporting to video, point cloud and mesh representations.\nThe modularity of Nerfstudio enables the development of Nerfacto, our method\nthat combines components from recent papers to achieve a balance between speed\nand quality, while also remaining flexible to future modifications. To promote\ncommunity-driven development, all associated code and data are made publicly\navailable with open-source licensing at https://nerf.studio.\n","authors":["Matthew Tancik","Ethan Weber","Evonne Ng","Ruilong Li","Brent Yi","Justin Kerr","Terrance Wang","Alexander Kristoffersen","Jake Austin","Kamyar Salahi","Abhik Ahuja","David McAllister","Angjoo Kanazawa"],"pdf_url":"https://arxiv.org/pdf/2302.04264v4.pdf","comment":"Project page at https://nerf.studio"},{"id":"http://arxiv.org/abs/2310.05647v3","updated":"2023-10-17T03:28:55Z","published":"2023-10-09T11:59:11Z","title":"Exploiting Manifold Structured Data Priors for Improved MR\n Fingerprinting Reconstruction","summary":" Estimating tissue parameter maps with high accuracy and precision from highly\nundersampled measurements presents one of the major challenges in MR\nfingerprinting (MRF). Many existing works project the recovered voxel\nfingerprints onto the Bloch manifold to improve reconstruction performance.\nHowever, little research focuses on exploiting the latent manifold structure\npriors among fingerprints. To fill this gap, we propose a novel MRF\nreconstruction framework based on manifold structured data priors. Since it is\ndifficult to directly estimate the fingerprint manifold structure, we model the\ntissue parameters as points on a low-dimensional parameter manifold. We reveal\nthat the fingerprint manifold shares the same intrinsic topology as the\nparameter manifold, although being embedded in different Euclidean spaces. To\nexploit the non-linear and non-local redundancies in MRF data, we divide the\nMRF data into spatial patches, and the similarity measurement among data\npatches can be accurately obtained using the Euclidean distance between the\ncorresponding patches in the parameter manifold. The measured similarity is\nthen used to construct the graph Laplacian operator, which represents the\nfingerprint manifold structure. Thus, the fingerprint manifold structure is\nintroduced in the reconstruction framework by using the low-dimensional\nparameter manifold. Additionally, we incorporate the locally low-rank prior in\nthe reconstruction framework to further utilize the local correlations within\neach patch for improved reconstruction performance. We also adopt a\nGPU-accelerated NUFFT library to accelerate reconstruction in non-Cartesian\nsampling scenarios. Experimental results demonstrate that our method can\nachieve significantly improved reconstruction performance with reduced\ncomputational time over the state-of-the-art methods.\n","authors":["Peng Li","Yuping Ji","Yue Hu"],"pdf_url":"https://arxiv.org/pdf/2310.05647v3.pdf","comment":"10 pages, 10 figures, will submit to IEEE Transactions on Medical\n Imaging"},{"id":"http://arxiv.org/abs/2303.11632v2","updated":"2023-10-17T03:26:22Z","published":"2023-03-21T07:00:13Z","title":"An Embarrassingly Simple Approach for Wafer Feature Extraction and\n Defect Pattern Recognition","summary":" Identifying defect patterns in a wafer map during manufacturing is crucial to\nfind the root cause of the underlying issue and provides valuable insights on\nimproving yield in the foundry. Currently used methods use deep neural networks\nto identify the defects. These methods are generally very huge and have\nsignificant inference time. They also require GPU support to efficiently\noperate. All these issues make these models not fit for on-line prediction in\nthe manufacturing foundry. In this paper, we propose an extremely simple yet\neffective technique to extract features from wafer images. The proposed method\nis extremely fast, intuitive, and non-parametric while being explainable. The\nexperiment results show that the proposed pipeline outperforms conventional\ndeep learning models. Our feature extraction requires no training or\nfine-tuning while preserving the relative shape and location of data points as\nrevealed by our interpretability analysis.\n","authors":["Nitish Shukla"],"pdf_url":"https://arxiv.org/pdf/2303.11632v2.pdf","comment":"study is not relevant"},{"id":"http://arxiv.org/abs/2303.13827v2","updated":"2023-10-17T03:26:07Z","published":"2023-03-24T06:24:07Z","title":"Efficient Mixed-Type Wafer Defect Pattern Recognition Using Compact\n Deformable Convolutional Transformers","summary":" Manufacturing wafers is an intricate task involving thousands of steps.\nDefect Pattern Recognition (DPR) of wafer maps is crucial to find the root\ncause of the issue and further improving the yield in the wafer foundry.\nMixed-type DPR is much more complicated compared to single-type DPR due to\nvaried spatial features, the uncertainty of defects, and the number of defects\npresent. To accurately predict the number of defects as well as the types of\ndefects, we propose a novel compact deformable convolutional transformer (DC\nTransformer). Specifically, DC Transformer focuses on the global features\npresent in the wafer map by virtue of learnable deformable kernels and\nmulti-head attention to the global features. The proposed method succinctly\nmodels the internal relationship between the wafer maps and the defects. DC\nTransformer is evaluated on a real dataset containing 38 defect patterns.\nExperimental results show that DC Transformer performs exceptionally well in\nrecognizing both single and mixed-type defects. The proposed method outperforms\nthe current state of the models by a considerable margin\n","authors":["Nitish Shukla"],"pdf_url":"https://arxiv.org/pdf/2303.13827v2.pdf","comment":"Study is not relevant"},{"id":"http://arxiv.org/abs/2310.10963v1","updated":"2023-10-17T03:25:22Z","published":"2023-10-17T03:25:22Z","title":"MRI brain tumor segmentation using informative feature vectors and\n kernel dictionary learning","summary":" This paper presents a method based on a kernel dictionary learning algorithm\nfor segmenting brain tumor regions in magnetic resonance images (MRI). A set of\nfirst-order and second-order statistical feature vectors are extracted from\npatches of size 3 * 3 around pixels in the brain MRI scans. These feature\nvectors are utilized to train two kernel dictionaries separately for healthy\nand tumorous tissues. To enhance the efficiency of the dictionaries and reduce\ntraining time, a correlation-based sample selection technique is developed to\nidentify the most informative and discriminative subset of feature vectors.\nThis technique aims to improve the performance of the dictionaries by selecting\na subset of feature vectors that provide valuable information for the\nsegmentation task. Subsequently, a linear classifier is utilized to distinguish\nbetween healthy and unhealthy pixels based on the learned dictionaries. The\nresults demonstrate that the proposed method outperforms other existing methods\nin terms of segmentation accuracy and significantly reduces both the time and\nmemory required, resulting in a remarkably fast training process.\n","authors":["Seyedeh Mahya Mousavi","Mohammad Mostafavi"],"pdf_url":"https://arxiv.org/pdf/2310.10963v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.10958v1","updated":"2023-10-17T03:11:30Z","published":"2023-10-17T03:11:30Z","title":"Enhancing Deep Neural Network Training Efficiency and Performance\n through Linear Prediction","summary":" Deep neural networks (DNN) have achieved remarkable success in various\nfields, including computer vision and natural language processing. However,\ntraining an effective DNN model still poses challenges. This paper aims to\npropose a method to optimize the training effectiveness of DNN, with the goal\nof improving model performance. Firstly, based on the observation that the DNN\nparameters change in certain laws during training process, the potential of\nparameter prediction for improving model training efficiency and performance is\ndiscovered. Secondly, considering the magnitude of DNN model parameters,\nhardware limitations and characteristics of Stochastic Gradient Descent (SGD)\nfor noise tolerance, a Parameter Linear Prediction (PLP) method is exploit to\nperform DNN parameter prediction. Finally, validations are carried out on some\nrepresentative backbones. Experiment results show that compare to the normal\ntraining ways, under the same training conditions and epochs, by employing\nproposed PLP method, the optimal model is able to obtain average about 1%\naccuracy improvement and 0.01 top-1/top-5 error reduction for Vgg16, Resnet18\nand GoogLeNet based on CIFAR-100 dataset, which shown the effectiveness of the\nproposed method on different DNN structures, and validated its capacity in\nenhancing DNN training efficiency and performance.\n","authors":["Hejie Ying","Mengmeng Song","Yaohong Tang","Shungen Xiao","Zimin Xiao"],"pdf_url":"https://arxiv.org/pdf/2310.10958v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.10957v1","updated":"2023-10-17T03:08:35Z","published":"2023-10-17T03:08:35Z","title":"Medical Image Segmentation via Sparse Coding Decoder","summary":" Transformers have achieved significant success in medical image segmentation,\nowing to its capability to capture long-range dependencies. Previous works\nincorporate convolutional layers into the encoder module of transformers,\nthereby enhancing their ability to learn local relationships among pixels.\nHowever, transformers may suffer from limited generalization capabilities and\nreduced robustness, attributed to the insufficient spatial recovery ability of\ntheir decoders. To address this issue, A convolution sparse vector coding based\ndecoder is proposed , namely CAScaded multi-layer Convolutional Sparse vector\nCoding DEcoder (CASCSCDE), which represents features extracted by the encoder\nusing sparse vectors. To prove the effectiveness of our CASCSCDE, The\nwidely-used TransUNet model is chosen for the demonstration purpose, and the\nCASCSCDE is incorporated with TransUNet to establish the TransCASCSCDE\narchitecture. Our experiments demonstrate that TransUNet with CASCSCDE\nsignificantly enhances performance on the Synapse benchmark, obtaining up to\n3.15\\% and 1.16\\% improvements in DICE and mIoU scores, respectively. CASCSCDE\nopens new ways for constructing decoders based on convolutional sparse vector\ncoding.\n","authors":["Long Zeng","Kaigui Wu"],"pdf_url":"https://arxiv.org/pdf/2310.10957v1.pdf","comment":"8 pages, 1 figures"},{"id":"http://arxiv.org/abs/2310.10951v1","updated":"2023-10-17T02:56:10Z","published":"2023-10-17T02:56:10Z","title":"FusionU-Net: U-Net with Enhanced Skip Connection for Pathology Image\n Segmentation","summary":" In recent years, U-Net and its variants have been widely used in pathology\nimage segmentation tasks. One of the key designs of U-Net is the use of skip\nconnections between the encoder and decoder, which helps to recover detailed\ninformation after upsampling. While most variations of U-Net adopt the original\nskip connection design, there is semantic gap between the encoder and decoder\nthat can negatively impact model performance. Therefore, it is important to\nreduce this semantic gap before conducting skip connection. To address this\nissue, we propose a new segmentation network called FusionU-Net, which is based\non U-Net structure and incorporates a fusion module to exchange information\nbetween different skip connections to reduce semantic gaps. Unlike the other\nfusion modules in existing networks, ours is based on a two-round fusion design\nthat fully considers the local relevance between adjacent encoder layer outputs\nand the need for bi-directional information exchange across multiple layers. We\nconducted extensive experiments on multiple pathology image datasets to\nevaluate our model and found that FusionU-Net achieves better performance\ncompared to other competing methods. We argue our fusion module is more\neffective than the designs of existing networks, and it could be easily\nembedded into other networks to further enhance the model performance.\n","authors":["Zongyi Li","Hongbing Lyu","Jun Wang"],"pdf_url":"https://arxiv.org/pdf/2310.10951v1.pdf","comment":"9 pages, 4 figures and 4 tables"},{"id":"http://arxiv.org/abs/2310.10942v1","updated":"2023-10-17T02:38:09Z","published":"2023-10-17T02:38:09Z","title":"Unanswerable Visual Question Answering","summary":" Teaching Visual Question Answering (VQA) models to abstain from unanswerable\nquestions is indispensable for building a trustworthy AI system. Existing\nstudies, though have explored various aspects of VQA, yet marginally ignored\nthis particular attribute. This paper aims to bridge the research gap by\ncontributing a comprehensive dataset, called UNK-VQA. The dataset is\nspecifically designed to address the challenge of questions that can be\nunanswerable. To this end, we first augment the existing data via deliberate\nperturbations on either the image or question. In specific, we carefully ensure\nthat the question-image semantics remain close to the original unperturbed\ndistribution. By means of this, the identification of unanswerable questions\nbecomes challenging, setting our dataset apart from others that involve mere\nimage replacement. We then extensively evaluate the zero- and few-shot\nperformance of several emerging multi-modal large models and discover\nsignificant limitations of them when applied to our dataset. Additionally, we\nalso propose a straightforward method to tackle these unanswerable questions.\nThis dataset, we believe, will serve as a valuable benchmark for enhancing the\nabstention capability of VQA models, thereby leading to increased\ntrustworthiness of AI systems.\n","authors":["Yanyang Guo","Fangkai Jiao","Zhiqi Shen","Liqiang Nie","Mohan Kankanhalli"],"pdf_url":"https://arxiv.org/pdf/2310.10942v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09759v2","updated":"2023-10-17T02:32:19Z","published":"2023-10-15T07:06:01Z","title":"Prototype-oriented Unsupervised Change Detection for Disaster Management","summary":" Climate change has led to an increased frequency of natural disasters such as\nfloods and cyclones. This emphasizes the importance of effective disaster\nmonitoring. In response, the remote sensing community has explored change\ndetection methods. These methods are primarily categorized into supervised\ntechniques, which yield precise results but come with high labeling costs, and\nunsupervised techniques, which eliminate the need for labeling but involve\nintricate hyperparameter tuning. To address these challenges, we propose a\nnovel unsupervised change detection method named Prototype-oriented\nUnsupervised Change Detection for Disaster Management (PUCD). PUCD captures\nchanges by comparing features from pre-event, post-event, and\nprototype-oriented change synthesis images via a foundational model, and\nrefines results using the Segment Anything Model (SAM). Although PUCD is an\nunsupervised change detection, it does not require complex hyperparameter\ntuning. We evaluate PUCD framework on the LEVIR-Extension dataset and the\ndisaster dataset and it achieves state-of-the-art performance compared to other\nmethods on the LEVIR-Extension dataset.\n","authors":["Youngtack Oh","Minseok Seo","Doyi Kim","Junghoon Seo"],"pdf_url":"https://arxiv.org/pdf/2310.09759v2.pdf","comment":"4page, 2 figures"},{"id":"http://arxiv.org/abs/2310.05136v4","updated":"2023-10-17T02:27:52Z","published":"2023-10-08T12:10:44Z","title":"InstructDET: Diversifying Referring Object Detection with Generalized\n Instructions","summary":" We propose InstructDET, a data-centric method for referring object detection\n(ROD) that localizes target objects based on user instructions. While deriving\nfrom referring expressions (REC), the instructions we leverage are greatly\ndiversified to encompass common user intentions related to object detection.\nFor one image, we produce tremendous instructions that refer to every single\nobject and different combinations of multiple objects. Each instruction and its\ncorresponding object bounding boxes (bbxs) constitute one training data pair.\nIn order to encompass common detection expressions, we involve emerging\nvision-language model (VLM) and large language model (LLM) to generate\ninstructions guided by text prompts and object bbxs, as the generalizations of\nfoundation models are effective to produce human-like expressions (e.g.,\ndescribing object property, category, and relationship). We name our\nconstructed dataset as InDET. It contains images, bbxs and generalized\ninstructions that are from foundation models. Our InDET is developed from\nexisting REC datasets and object detection datasets, with the expanding\npotential that any image with object bbxs can be incorporated through using our\nInstructDET method. By using our InDET dataset, we show that a conventional ROD\nmodel surpasses existing methods on standard REC datasets and our InDET test\nset. Our data-centric method InstructDET, with automatic data expansion by\nleveraging foundation models, directs a promising field that ROD can be greatly\ndiversified to execute common object detection instructions.\n","authors":["Ronghao Dang","Jiangyan Feng","Haodong Zhang","Chongjian Ge","Lin Song","Lijun Gong","Chengju Liu","Qijun Chen","Feng Zhu","Rui Zhao","Yibing Song"],"pdf_url":"https://arxiv.org/pdf/2310.05136v4.pdf","comment":"27 pages (include Appendix) Technical Report"},{"id":"http://arxiv.org/abs/2301.05856v2","updated":"2023-10-17T02:07:00Z","published":"2023-01-14T08:32:16Z","title":"EARL: An Elliptical Distribution aided Adaptive Rotation Label\n Assignment for Oriented Object Detection in Remote Sensing Images","summary":" Label assignment is a crucial process in object detection, which\nsignificantly influences the detection performance by determining positive or\nnegative samples during training process. However, existing label assignment\nstrategies barely consider the characteristics of targets in remote sensing\nimages (RSIs) thoroughly, e.g., large variations in scales and aspect ratios,\nleading to insufficient and imbalanced sampling and introducing more\nlow-quality samples, thereby limiting detection performance. To solve the above\nproblems, an Elliptical Distribution aided Adaptive Rotation Label Assignment\n(EARL) is proposed to select high-quality positive samples adaptively in\nanchor-free detectors. Specifically, an adaptive scale sampling (ADS) strategy\nis presented to select samples adaptively among multi-level feature maps\naccording to the scales of targets, which achieves sufficient sampling with\nmore balanced scale-level sample distribution. In addition, a dynamic\nelliptical distribution aided sampling (DED) strategy is proposed to make the\nsample distribution more flexible to fit the shapes and orientations of\ntargets, and filter out low-quality samples. Furthermore, a spatial distance\nweighting (SDW) module is introduced to integrate the adaptive distance\nweighting into loss function, which makes the detector more focused on the\nhigh-quality samples. Extensive experiments on several popular datasets\ndemonstrate the effectiveness and superiority of our proposed EARL, where\nwithout bells and whistles, it can be easily applied to different detectors and\nachieve state-of-the-art performance. The source code will be available at:\nhttps://github.com/Justlovesmile/EARL.\n","authors":["Jian Guan","Mingjie Xie","Youtian Lin","Guangjun He","Pengming Feng"],"pdf_url":"https://arxiv.org/pdf/2301.05856v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.16979v2","updated":"2023-10-17T02:06:18Z","published":"2023-06-29T14:33:20Z","title":"Defending Black-box Classifiers by Bayesian Boundary Correction","summary":" Classifiers based on deep neural networks have been recently challenged by\nAdversarial Attack, where the widely existing vulnerability has invoked the\nresearch in defending them from potential threats. Given a vulnerable\nclassifier, existing defense methods are mostly white-box and often require\nre-training the victim under modified loss functions/training regimes. While\nthe model/data/training specifics of the victim are usually unavailable to the\nuser, re-training is unappealing, if not impossible for reasons such as limited\ncomputational resources. To this end, we propose a new black-box defense\nframework. It can turn any pre-trained classifier into a resilient one with\nlittle knowledge of the model specifics. This is achieved by new joint Bayesian\ntreatments on the clean data, the adversarial examples and the classifier, for\nmaximizing their joint probability. It is further equipped with a new\npost-train strategy which keeps the victim intact. We name our framework\nBayesian Boundary Correction (BBC). BBC is a general and flexible framework\nthat can easily adapt to different data types. We instantiate BBC for image\nclassification and skeleton-based human activity recognition, for both static\nand dynamic data. Exhaustive evaluation shows that BBC has superior robustness\nand can enhance robustness without severely hurting the clean accuracy,\ncompared with existing defense methods.\n","authors":["He Wang","Yunfeng Diao"],"pdf_url":"https://arxiv.org/pdf/2306.16979v2.pdf","comment":"arXiv admin note: text overlap with arXiv:2203.04713"},{"id":"http://arxiv.org/abs/2309.04965v2","updated":"2023-10-17T01:30:57Z","published":"2023-09-10T08:55:24Z","title":"Prefix-diffusion: A Lightweight Diffusion Model for Diverse Image\n Captioning","summary":" While impressive performance has been achieved in image captioning, the\nlimited diversity of the generated captions and the large parameter scale\nremain major barriers to the real-word application of these systems. In this\nwork, we propose a lightweight image captioning network in combination with\ncontinuous diffusion, called Prefix-diffusion. To achieve diversity, we design\nan efficient method that injects prefix image embeddings into the denoising\nprocess of the diffusion model. In order to reduce trainable parameters, we\nemploy a pre-trained model to extract image features and further design an\nextra mapping network. Prefix-diffusion is able to generate diverse captions\nwith relatively less parameters, while maintaining the fluency and relevance of\nthe captions benefiting from the generative capabilities of the diffusion\nmodel. Our work paves the way for scaling up diffusion models for image\ncaptioning, and achieves promising performance compared with recent approaches.\n","authors":["Guisheng Liu","Yi Li","Zhengcong Fei","Haiyan Fu","Xiangyang Luo","Yanqing Guo"],"pdf_url":"https://arxiv.org/pdf/2309.04965v2.pdf","comment":"11 pages,4 figures, 6 tables"},{"id":"http://arxiv.org/abs/2310.10912v1","updated":"2023-10-17T01:12:08Z","published":"2023-10-17T01:12:08Z","title":"Towards Training-free Open-world Segmentation via Image Prompting\n Foundation Models","summary":" The realm of computer vision has witnessed a paradigm shift with the advent\nof foundational models, mirroring the transformative influence of large\nlanguage models in the domain of natural language processing. This paper delves\ninto the exploration of open-world segmentation, presenting a novel approach\ncalled Image Prompt Segmentation (IPSeg) that harnesses the power of vision\nfoundational models. At the heart of IPSeg lies the principle of a\ntraining-free paradigm, which capitalizes on image prompting techniques. IPSeg\nutilizes a single image containing a subjective visual concept as a flexible\nprompt to query vision foundation models like DINOv2 and Stable Diffusion. Our\napproach extracts robust features for the prompt image and input image, then\nmatches the input representations to the prompt representations via a novel\nfeature interaction module to generate point prompts highlighting target\nobjects in the input image. The generated point prompts are further utilized to\nguide the Segment Anything Model to segment the target object in the input\nimage. The proposed method stands out by eliminating the need for exhaustive\ntraining sessions, thereby offering a more efficient and scalable solution.\nExperiments on COCO, PASCAL VOC, and other datasets demonstrate IPSeg's\nefficacy for flexible open-world segmentation using intuitive image prompts.\nThis work pioneers tapping foundation models for open-world understanding\nthrough visual concepts conveyed in images.\n","authors":["Lv Tang","Peng-Tao Jiang","Hao-Ke Xiao","Bo Li"],"pdf_url":"https://arxiv.org/pdf/2310.10912v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.04780v4","updated":"2023-10-17T00:51:46Z","published":"2023-10-07T11:45:33Z","title":"IPMix: Label-Preserving Data Augmentation Method for Training Robust\n Classifiers","summary":" Data augmentation has been proven effective for training high-accuracy\nconvolutional neural network classifiers by preventing overfitting. However,\nbuilding deep neural networks in real-world scenarios requires not only high\naccuracy on clean data but also robustness when data distributions shift. While\nprior methods have proposed that there is a trade-off between accuracy and\nrobustness, we propose IPMix, a simple data augmentation approach to improve\nrobustness without hurting clean accuracy. IPMix integrates three levels of\ndata augmentation (image-level, patch-level, and pixel-level) into a coherent\nand label-preserving technique to increase the diversity of training data with\nlimited computational overhead. To further improve the robustness, IPMix\nintroduces structural complexity at different levels to generate more diverse\nimages and adopts the random mixing method for multi-scale information fusion.\nExperiments demonstrate that IPMix outperforms state-of-the-art corruption\nrobustness on CIFAR-C and ImageNet-C. In addition, we show that IPMix also\nsignificantly improves the other safety measures, including robustness to\nadversarial perturbations, calibration, prediction consistency, and anomaly\ndetection, achieving state-of-the-art or comparable results on several\nbenchmarks, including ImageNet-R, ImageNet-A, and ImageNet-O.\n","authors":["Zhenglin Huang","Xianan Bao","Na Zhang","Qingqi Zhang","Xiaomei Tu","Biao Wu","Xi Yang"],"pdf_url":"https://arxiv.org/pdf/2310.04780v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.01998v2","updated":"2023-10-17T00:48:36Z","published":"2023-07-05T03:07:00Z","title":"Zero-Shot Neural Architecture Search: Challenges, Solutions, and\n Opportunities","summary":" Recently, zero-shot (or training-free) Neural Architecture Search (NAS)\napproaches have been proposed to liberate NAS from the expensive training\nprocess. The key idea behind zero-shot NAS approaches is to design proxies that\ncan predict the accuracy of some given networks without training the network\nparameters. The proxies proposed so far are usually inspired by recent progress\nin theoretical understanding of deep learning and have shown great potential on\nseveral datasets and NAS benchmarks. This paper aims to comprehensively review\nand compare the state-of-the-art (SOTA) zero-shot NAS approaches, with an\nemphasis on their hardware awareness. To this end, we first review the\nmainstream zero-shot proxies and discuss their theoretical underpinnings. We\nthen compare these zero-shot proxies through large-scale experiments and\ndemonstrate their effectiveness in both hardware-aware and hardware-oblivious\nNAS scenarios. Finally, we point out several promising ideas to design better\nproxies. Our source code and the list of related papers are available on\nhttps://github.com/SLDGroup/survey-zero-shot-nas.\n","authors":["Guihong Li","Duc Hoang","Kartikeya Bhardwaj","Ming Lin","Zhangyang Wang","Radu Marculescu"],"pdf_url":"https://arxiv.org/pdf/2307.01998v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.08586v2","updated":"2023-10-17T00:12:20Z","published":"2023-09-15T17:43:40Z","title":"Replacing softmax with ReLU in Vision Transformers","summary":" Previous research observed accuracy degradation when replacing the attention\nsoftmax with a point-wise activation such as ReLU. In the context of vision\ntransformers, we find that this degradation is mitigated when dividing by\nsequence length. Our experiments training small to large vision transformers on\nImageNet-21k indicate that ReLU-attention can approach or match the performance\nof softmax-attention in terms of scaling behavior as a function of compute.\n","authors":["Mitchell Wortsman","Jaehoon Lee","Justin Gilmer","Simon Kornblith"],"pdf_url":"https://arxiv.org/pdf/2309.08586v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.13013v2","updated":"2023-10-17T00:11:15Z","published":"2023-04-25T17:38:18Z","title":"Stable and low-precision training for large-scale vision-language models","summary":" We introduce new methods for 1) accelerating and 2) stabilizing training for\nlarge language-vision models. 1) For acceleration, we introduce SwitchBack, a\nlinear layer for int8 quantized training which provides a speed-up of 13-25%\nwhile matching the performance of bfloat16 training within 0.1 percentage\npoints for the 1B parameter CLIP ViT-Huge -- the largest int8 training to date.\nOur main focus is int8 as GPU support for float8 is rare, though we also\nanalyze float8 training through simulation. While SwitchBack proves effective\nfor float8, we show that standard techniques are also successful if the network\nis trained and initialized so that large feature magnitudes are discouraged,\nwhich we accomplish via layer-scale initialized with zeros. 2) For stability,\nwe analyze loss spikes and find they consistently occur 1-8 iterations after\nthe squared gradients become under-estimated by their AdamW second moment\nestimator. As a result, we recommend an AdamW-Adafactor hybrid which avoids\nloss spikes when training a CLIP ViT-Huge model and outperforms gradient\nclipping at the scales we test.\n","authors":["Mitchell Wortsman","Tim Dettmers","Luke Zettlemoyer","Ari Morcos","Ali Farhadi","Ludwig Schmidt"],"pdf_url":"https://arxiv.org/pdf/2304.13013v2.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.11629v1","updated":"2023-10-17T23:37:23Z","published":"2023-10-17T23:37:23Z","title":"Holistic Parking Slot Detection with Polygon-Shaped Representations","summary":" Current parking slot detection in advanced driver-assistance systems (ADAS)\nprimarily relies on ultrasonic sensors. This method has several limitations\nsuch as the need to scan the entire parking slot before detecting it, the\nincapacity of detecting multiple slots in a row, and the difficulty of\nclassifying them. Due to the complex visual environment, vehicles are equipped\nwith surround view camera systems to detect vacant parking slots. Previous\nresearch works in this field mostly use image-domain models to solve the\nproblem. These two-stage approaches separate the 2D detection and 3D pose\nestimation steps using camera calibration. In this paper, we propose one-step\nHolistic Parking Slot Network (HPS-Net), a tailor-made adaptation of the You\nOnly Look Once (YOLO)v4 algorithm. This camera-based approach directly outputs\nthe four vertex coordinates of the parking slot in topview domain, instead of a\nbounding box in raw camera images. Several visible points and shapes can be\nproposed from different angles. A novel regression loss function named\npolygon-corner Generalized Intersection over Union (GIoU) for polygon vertex\nposition optimization is also proposed to manage the slot orientation and to\ndistinguish the entrance line. Experiments show that HPS-Net can detect various\nvacant parking slots with a F1-score of 0.92 on our internal Valeo Parking\nSlots Dataset (VPSD) and 0.99 on the public dataset PS2.0. It provides a\nsatisfying generalization and robustness in various parking scenarios, such as\nindoor (F1: 0.86) or paved ground (F1: 0.91). Moreover, it achieves a real-time\ndetection speed of 17 FPS on Nvidia Drive AGX Xavier. A demo video can be found\nat https://streamable.com/75j7sj.\n","authors":["Lihao Wang","Antonyo Musabini","Christel Leonet","Rachid Benmokhtar","Amaury Breheret","Chaima Yedes","Fabian Burger","Thomas Boulay","Xavier Perrotton"],"pdf_url":"https://arxiv.org/pdf/2310.11629v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11622v1","updated":"2023-10-17T23:20:36Z","published":"2023-10-17T23:20:36Z","title":"High-Resolution Building and Road Detection from Sentinel-2","summary":" Mapping buildings and roads automatically with remote sensing typically\nrequires high-resolution imagery, which is expensive to obtain and often\nsparsely available. In this work we demonstrate how multiple 10 m resolution\nSentinel-2 images can be used to generate 50 cm resolution building and road\nsegmentation masks. This is done by training a `student' model with access to\nSentinel-2 images to reproduce the predictions of a `teacher' model which has\naccess to corresponding high-resolution imagery. While the predictions do not\nhave all the fine detail of the teacher model, we find that we are able to\nretain much of the performance: for building segmentation we achieve 78.3%\nmIoU, compared to the high-resolution teacher model accuracy of 85.3% mIoU. We\nalso describe a related method for counting individual buildings in a\nSentinel-2 patch which achieves R^2 = 0.91 against true counts. This work opens\nup new possibilities for using freely available Sentinel-2 imagery for a range\nof tasks that previously could only be done with high-resolution satellite\nimagery.\n","authors":["Wojciech Sirko","Emmanuel Asiedu Brempong","Juliana T. C. Marcos","Abigail Annkah","Abel Korme","Mohammed Alewi Hassen","Krishna Sapkota","Tomer Shekel","Abdoulaye Diack","Sella Nevo","Jason Hickey","John Quinn"],"pdf_url":"https://arxiv.org/pdf/2310.11622v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09199v2","updated":"2023-10-17T22:38:51Z","published":"2023-10-13T15:45:19Z","title":"PaLI-3 Vision Language Models: Smaller, Faster, Stronger","summary":" This paper presents PaLI-3, a smaller, faster, and stronger vision language\nmodel (VLM) that compares favorably to similar models that are 10x larger. As\npart of arriving at this strong performance, we compare Vision Transformer\n(ViT) models pretrained using classification objectives to contrastively\n(SigLIP) pretrained ones. We find that, while slightly underperforming on\nstandard image classification benchmarks, SigLIP-based PaLI shows superior\nperformance across various multimodal benchmarks, especially on localization\nand visually-situated text understanding. We scale the SigLIP image encoder up\nto 2 billion parameters, and achieves a new state-of-the-art on multilingual\ncross-modal retrieval. We hope that PaLI-3, at only 5B parameters, rekindles\nresearch on fundamental pieces of complex VLMs, and could fuel a new generation\nof scaled-up models.\n","authors":["Xi Chen","Xiao Wang","Lucas Beyer","Alexander Kolesnikov","Jialin Wu","Paul Voigtlaender","Basil Mustafa","Sebastian Goodman","Ibrahim Alabdulmohsin","Piotr Padlewski","Daniel Salz","Xi Xiong","Daniel Vlasic","Filip Pavetic","Keran Rong","Tianli Yu","Daniel Keysers","Xiaohua Zhai","Radu Soricut"],"pdf_url":"https://arxiv.org/pdf/2310.09199v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2209.09616v8","updated":"2023-10-17T22:06:33Z","published":"2022-09-19T09:16:07Z","title":"Provably Uncertainty-Guided Universal Domain Adaptation","summary":" Universal domain adaptation (UniDA) aims to transfer the knowledge from a\nlabeled source domain to an unlabeled target domain without any assumptions of\nthe label sets, which requires distinguishing the unknown samples from the\nknown ones in the target domain. A main challenge of UniDA is that the\nnonidentical label sets cause the misalignment between the two domains.\nMoreover, the domain discrepancy and the supervised objectives in the source\ndomain easily lead the whole model to be biased towards the common classes and\nproduce overconfident predictions for unknown samples. To address the above\nchallenging problems, we propose a new uncertainty-guided UniDA framework.\nFirstly, we introduce an empirical estimation of the probability of a target\nsample belonging to the unknown class which fully exploits the distribution of\nthe target samples in the latent space. Then, based on the estimation, we\npropose a novel neighbors searching scheme in a linear subspace with a\n$\\delta$-filter to estimate the uncertainty score of a target sample and\ndiscover unknown samples. It fully utilizes the relationship between a target\nsample and its neighbors in the source domain to avoid the influence of domain\nmisalignment. Secondly, this paper well balances the confidences of predictions\nfor both known and unknown samples through an uncertainty-guided margin loss\nbased on the confidences of discovered unknown samples, which can reduce the\ngap between the intra-class variances of known classes with respect to the\nunknown class. Finally, experiments on three public datasets demonstrate that\nour method significantly outperforms existing state-of-the-art methods.\n","authors":["Yifan Wang","Lin Zhang","Ran Song","Paul L. Rosin","Yibin Li","Wei Zhang"],"pdf_url":"https://arxiv.org/pdf/2209.09616v8.pdf","comment":"13 pages. arXiv admin note: text overlap with arXiv:2207.09280"},{"id":"http://arxiv.org/abs/2310.11608v1","updated":"2023-10-17T22:04:42Z","published":"2023-10-17T22:04:42Z","title":"Classification of Safety Driver Attention During Autonomous Vehicle\n Operation","summary":" Despite the continual advances in Advanced Driver Assistance Systems (ADAS)\nand the development of high-level autonomous vehicles (AV), there is a general\nconsensus that for the short to medium term, there is a requirement for a human\nsupervisor to handle the edge cases that inevitably arise. Given this\nrequirement, it is essential that the state of the vehicle operator is\nmonitored to ensure they are contributing to the vehicle's safe operation. This\npaper introduces a dual-source approach integrating data from an infrared\ncamera facing the vehicle operator and vehicle perception systems to produce a\nmetric for driver alertness in order to promote and ensure safe operator\nbehaviour. The infrared camera detects the driver's head, enabling the\ncalculation of head orientation, which is relevant as the head typically moves\naccording to the individual's focus of attention. By incorporating\nenvironmental data from the perception system, it becomes possible to determine\nwhether the vehicle operator observes objects in the surroundings. Experiments\nwere conducted using data collected in Sydney, Australia, simulating AV\noperations in an urban environment. Our results demonstrate that the proposed\nsystem effectively determines a metric for the attention levels of the vehicle\noperator, enabling interventions such as warnings or reducing autonomous\nfunctionality as appropriate. This comprehensive solution shows promise in\ncontributing to ADAS and AVs' overall safety and efficiency in a real-world\nsetting.\n","authors":["Santiago Gerling Konrad","Julie Stephany Berrio","Mao Shan","Favio Masson","Stewart Worrall"],"pdf_url":"https://arxiv.org/pdf/2310.11608v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.09473v2","updated":"2023-10-17T22:01:00Z","published":"2023-02-19T04:03:22Z","title":"Video-Text Retrieval by Supervised Sparse Multi-Grained Learning","summary":" While recent progress in video-text retrieval has been advanced by the\nexploration of better representation learning, in this paper, we present a\nnovel multi-grained sparse learning framework, S3MA, to learn an aligned sparse\nspace shared between the video and the text for video-text retrieval. The\nshared sparse space is initialized with a finite number of sparse concepts,\neach of which refers to a number of words. With the text data at hand, we learn\nand update the shared sparse space in a supervised manner using the proposed\nsimilarity and alignment losses. Moreover, to enable multi-grained alignment,\nwe incorporate frame representations for better modeling the video modality and\ncalculating fine-grained and coarse-grained similarities. Benefiting from the\nlearned shared sparse space and multi-grained similarities, extensive\nexperiments on several video-text retrieval benchmarks demonstrate the\nsuperiority of S3MA over existing methods. Our code is available at\nhttps://github.com/yimuwangcs/Better_Cross_Modal_Retrieval.\n","authors":["Yimu Wang","Peng Shi"],"pdf_url":"https://arxiv.org/pdf/2302.09473v2.pdf","comment":"Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.11605v1","updated":"2023-10-17T21:59:45Z","published":"2023-10-17T21:59:45Z","title":"DIAR: Deep Image Alignment and Reconstruction using Swin Transformers","summary":" When taking images of some occluded content, one is often faced with the\nproblem that every individual image frame contains unwanted artifacts, but a\ncollection of images contains all relevant information if properly aligned and\naggregated. In this paper, we attempt to build a deep learning pipeline that\nsimultaneously aligns a sequence of distorted images and reconstructs them. We\ncreate a dataset that contains images with image distortions, such as lighting,\nspecularities, shadows, and occlusion. We create perspective distortions with\ncorresponding ground-truth homographies as labels. We use our dataset to train\nSwin transformer models to analyze sequential image data. The attention maps\nenable the model to detect relevant image content and differentiate it from\noutliers and artifacts. We further explore using neural feature maps as\nalternatives to classical key point detectors. The feature maps of trained\nconvolutional layers provide dense image descriptors that can be used to find\npoint correspondences between images. We utilize this to compute coarse image\nalignments and explore its limitations.\n","authors":["Monika Kwiatkowski","Simon Matern","Olaf Hellwich"],"pdf_url":"https://arxiv.org/pdf/2310.11605v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11598v1","updated":"2023-10-17T21:45:51Z","published":"2023-10-17T21:45:51Z","title":"Learning Neural Implicit through Volume Rendering with Attentive Depth\n Fusion Priors","summary":" Learning neural implicit representations has achieved remarkable performance\nin 3D reconstruction from multi-view images. Current methods use volume\nrendering to render implicit representations into either RGB or depth images\nthat are supervised by multi-view ground truth. However, rendering a view each\ntime suffers from incomplete depth at holes and unawareness of occluded\nstructures from the depth supervision, which severely affects the accuracy of\ngeometry inference via volume rendering. To resolve this issue, we propose to\nlearn neural implicit representations from multi-view RGBD images through\nvolume rendering with an attentive depth fusion prior. Our prior allows neural\nnetworks to perceive coarse 3D structures from the Truncated Signed Distance\nFunction (TSDF) fused from all depth images available for rendering. The TSDF\nenables accessing the missing depth at holes on one depth image and the\noccluded parts that are invisible from the current view. By introducing a novel\nattention mechanism, we allow neural networks to directly use the depth fusion\nprior with the inferred occupancy as the learned implicit function. Our\nattention mechanism works with either a one-time fused TSDF that represents a\nwhole scene or an incrementally fused TSDF that represents a partial scene in\nthe context of Simultaneous Localization and Mapping (SLAM). Our evaluations on\nwidely used benchmarks including synthetic and real-world scans show our\nsuperiority over the latest neural implicit methods. Project page:\nhttps://machineperceptionlab.github.io/Attentive_DF_Prior/\n","authors":["Pengchong Hu","Zhizhong Han"],"pdf_url":"https://arxiv.org/pdf/2310.11598v1.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.11595v1","updated":"2023-10-17T21:43:42Z","published":"2023-10-17T21:43:42Z","title":"WaveAttack: Asymmetric Frequency Obfuscation-based Backdoor Attacks\n Against Deep Neural Networks","summary":" Due to the popularity of Artificial Intelligence (AI) technology, numerous\nbackdoor attacks are designed by adversaries to mislead deep neural network\npredictions by manipulating training samples and training processes. Although\nbackdoor attacks are effective in various real scenarios, they still suffer\nfrom the problems of both low fidelity of poisoned samples and non-negligible\ntransfer in latent space, which make them easily detectable by existing\nbackdoor detection algorithms. To overcome the weakness, this paper proposes a\nnovel frequency-based backdoor attack method named WaveAttack, which obtains\nimage high-frequency features through Discrete Wavelet Transform (DWT) to\ngenerate backdoor triggers. Furthermore, we introduce an asymmetric frequency\nobfuscation method, which can add an adaptive residual in the training and\ninference stage to improve the impact of triggers and further enhance the\neffectiveness of WaveAttack. Comprehensive experimental results show that\nWaveAttack not only achieves higher stealthiness and effectiveness, but also\noutperforms state-of-the-art (SOTA) backdoor attack methods in the fidelity of\nimages by up to 28.27\\% improvement in PSNR, 1.61\\% improvement in SSIM, and\n70.59\\% reduction in IS. Our code is available at\nhttps://anonymous.4open.science/r/AnonymousRep-701D.\n","authors":["Jun Xia","Zhihao Yue","Yingbo Zhou","Zhiwei Ling","Xian Wei","Mingsong Chen"],"pdf_url":"https://arxiv.org/pdf/2310.11595v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11577v1","updated":"2023-10-17T20:55:53Z","published":"2023-10-17T20:55:53Z","title":"Studying the Effects of Sex-related Differences on Brain Age Prediction\n using brain MR Imaging","summary":" While utilizing machine learning models, one of the most crucial aspects is\nhow bias and fairness affect model outcomes for diverse demographics. This\nbecomes especially relevant in the context of machine learning for medical\nimaging applications as these models are increasingly being used for diagnosis\nand treatment planning. In this paper, we study biases related to sex when\ndeveloping a machine learning model based on brain magnetic resonance images\n(MRI). We investigate the effects of sex by performing brain age prediction\nconsidering different experimental designs: model trained using only female\nsubjects, only male subjects and a balanced dataset. We also perform evaluation\non multiple MRI datasets (Calgary-Campinas(CC359) and CamCAN) to assess the\ngeneralization capability of the proposed models. We found disparities in the\nperformance of brain age prediction models when trained on distinct sex\nsubgroups and datasets, in both final predictions and decision making (assessed\nusing interpretability models). Our results demonstrated variations in model\ngeneralizability across sex-specific subgroups, suggesting potential biases in\nmodels trained on unbalanced datasets. This underlines the critical role of\ncareful experimental design in generating fair and reliable outcomes.\n","authors":["Mahsa Dibaji","Neha Gianchandani","Akhil Nair","Mansi Singhal","Roberto Souza","Mariana Bento"],"pdf_url":"https://arxiv.org/pdf/2310.11577v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.12237v2","updated":"2023-10-17T20:50:05Z","published":"2023-03-21T23:44:02Z","title":"Automated deep learning segmentation of high-resolution 7 T postmortem\n MRI for quantitative analysis of structure-pathology correlations in\n neurodegenerative diseases","summary":" Postmortem MRI allows brain anatomy to be examined at high resolution and to\nlink pathology measures with morphometric measurements. However, automated\nsegmentation methods for brain mapping in postmortem MRI are not well\ndeveloped, primarily due to limited availability of labeled datasets, and\nheterogeneity in scanner hardware and acquisition protocols. In this work, we\npresent a high resolution of 135 postmortem human brain tissue specimens imaged\nat 0.3 mm$^{3}$ isotropic using a T2w sequence on a 7T whole-body MRI scanner.\nWe developed a deep learning pipeline to segment the cortical mantle by\nbenchmarking the performance of nine deep neural architectures, followed by\npost-hoc topological correction. We then segment four subcortical structures\n(caudate, putamen, globus pallidus, and thalamus), white matter\nhyperintensities, and the normal appearing white matter. We show generalizing\ncapabilities across whole brain hemispheres in different specimens, and also on\nunseen images acquired at 0.28 mm^3 and 0.16 mm^3 isotropic T2*w FLASH sequence\nat 7T. We then compute localized cortical thickness and volumetric measurements\nacross key regions, and link them with semi-quantitative neuropathological\nratings. Our code, Jupyter notebooks, and the containerized executables are\npublicly available at: https://pulkit-khandelwal.github.io/exvivo-brain-upenn\n","authors":["Pulkit Khandelwal","Michael Tran Duong","Shokufeh Sadaghiani","Sydney Lim","Amanda Denning","Eunice Chung","Sadhana Ravikumar","Sanaz Arezoumandan","Claire Peterson","Madigan Bedard","Noah Capp","Ranjit Ittyerah","Elyse Migdal","Grace Choi","Emily Kopp","Bridget Loja","Eusha Hasan","Jiacheng Li","Alejandra Bahena","Karthik Prabhakaran","Gabor Mizsei","Marianna Gabrielyan","Theresa Schuck","Winifred Trotman","John Robinson","Daniel Ohm","Edward B. Lee","John Q. Trojanowski","Corey McMillan","Murray Grossman","David J. Irwin","John Detre","M. Dylan Tisdall","Sandhitsu R. Das","Laura E. M. Wisse","David A. Wolk","Paul A. Yushkevich"],"pdf_url":"https://arxiv.org/pdf/2303.12237v2.pdf","comment":"Preprint submitted to NeuroImage Project website:\n https://pulkit-khandelwal.github.io/exvivo-brain-upenn"},{"id":"http://arxiv.org/abs/2310.11535v1","updated":"2023-10-17T19:10:45Z","published":"2023-10-17T19:10:45Z","title":"Learning Lens Blur Fields","summary":" Optical blur is an inherent property of any lens system and is challenging to\nmodel in modern cameras because of their complex optical elements. To tackle\nthis challenge, we introduce a high-dimensional neural representation of\nblur$-$$\\textit{the lens blur field}$$-$and a practical method for acquiring\nit. The lens blur field is a multilayer perceptron (MLP) designed to (1)\naccurately capture variations of the lens 2D point spread function over image\nplane location, focus setting and, optionally, depth and (2) represent these\nvariations parametrically as a single, sensor-specific function. The\nrepresentation models the combined effects of defocus, diffraction, aberration,\nand accounts for sensor features such as pixel color filters and pixel-specific\nmicro-lenses. To learn the real-world blur field of a given device, we\nformulate a generalized non-blind deconvolution problem that directly optimizes\nthe MLP weights using a small set of focal stacks as the only input. We also\nprovide a first-of-its-kind dataset of 5D blur fields$-$for smartphone cameras,\ncamera bodies equipped with a variety of lenses, etc. Lastly, we show that\nacquired 5D blur fields are expressive and accurate enough to reveal, for the\nfirst time, differences in optical behavior of smartphone devices of the same\nmake and model.\n","authors":["Esther Y. H. Lin","Zhecheng Wang","Rebecca Lin","Daniel Miau","Florian Kainz","Jiawen Chen","Xuaner Cecilia Zhang","David B. Lindell","Kiriakos N. Kutulakos"],"pdf_url":"https://arxiv.org/pdf/2310.11535v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.15084v2","updated":"2023-10-17T18:54:20Z","published":"2023-09-26T17:27:22Z","title":"The Surveillance AI Pipeline","summary":" A rapidly growing number of voices argue that AI research, and computer\nvision in particular, is powering mass surveillance. Yet the direct path from\ncomputer vision research to surveillance has remained obscured and difficult to\nassess. Here, we reveal the Surveillance AI pipeline by analyzing three decades\nof computer vision research papers and downstream patents, more than 40,000\ndocuments. We find the large majority of annotated computer vision papers and\npatents self-report their technology enables extracting data about humans.\nMoreover, the majority of these technologies specifically enable extracting\ndata about human bodies and body parts. We present both quantitative and rich\nqualitative analysis illuminating these practices of human data extraction.\nStudying the roots of this pipeline, we find that institutions that\nprolifically produce computer vision research, namely elite universities and\n\"big tech\" corporations, are subsequently cited in thousands of surveillance\npatents. Further, we find consistent evidence against the narrative that only\nthese few rogue entities are contributing to surveillance. Rather, we expose\nthe fieldwide norm that when an institution, nation, or subfield authors\ncomputer vision papers with downstream patents, the majority of these papers\nare used in surveillance patents. In total, we find the number of papers with\ndownstream surveillance patents increased more than five-fold between the 1990s\nand the 2010s, with computer vision research now having been used in more than\n11,000 surveillance patents. Finally, in addition to the high levels of\nsurveillance we find documented in computer vision papers and patents, we\nunearth pervasive patterns of documents using language that obfuscates the\nextent of surveillance. Our analysis reveals the pipeline by which computer\nvision research has powered the ongoing expansion of surveillance.\n","authors":["Pratyusha Ria Kalluri","William Agnew","Myra Cheng","Kentrell Owens","Luca Soldaini","Abeba Birhane"],"pdf_url":"https://arxiv.org/pdf/2309.15084v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11513v1","updated":"2023-10-17T18:20:03Z","published":"2023-10-17T18:20:03Z","title":"GenEval: An Object-Focused Framework for Evaluating Text-to-Image\n Alignment","summary":" Recent breakthroughs in diffusion models, multimodal pretraining, and\nefficient finetuning have led to an explosion of text-to-image generative\nmodels. Given human evaluation is expensive and difficult to scale, automated\nmethods are critical for evaluating the increasingly large number of new\nmodels. However, most current automated evaluation metrics like FID or\nCLIPScore only offer a holistic measure of image quality or image-text\nalignment, and are unsuited for fine-grained or instance-level analysis. In\nthis paper, we introduce GenEval, an object-focused framework to evaluate\ncompositional image properties such as object co-occurrence, position, count,\nand color. We show that current object detection models can be leveraged to\nevaluate text-to-image models on a variety of generation tasks with strong\nhuman agreement, and that other discriminative vision models can be linked to\nthis pipeline to further verify properties like object color. We then evaluate\nseveral open-source text-to-image models and analyze their relative generative\ncapabilities on our benchmark. We find that recent models demonstrate\nsignificant improvement on these tasks, though they are still lacking in\ncomplex capabilities such as spatial relations and attribute binding. Finally,\nwe demonstrate how GenEval might be used to help discover existing failure\nmodes, in order to inform development of the next generation of text-to-image\nmodels. Our code to run the GenEval framework is publicly available at\nhttps://github.com/djghosh13/geneval.\n","authors":["Dhruba Ghosh","Hanna Hajishirzi","Ludwig Schmidt"],"pdf_url":"https://arxiv.org/pdf/2310.11513v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09978v2","updated":"2023-10-17T18:15:15Z","published":"2023-10-15T23:05:17Z","title":"Chinese Painting Style Transfer Using Deep Generative Models","summary":" Artistic style transfer aims to modify the style of the image while\npreserving its content. Style transfer using deep learning models has been\nwidely studied since 2015, and most of the applications are focused on specific\nartists like Van Gogh, Monet, Cezanne. There are few researches and\napplications on traditional Chinese painting style transfer. In this paper, we\nwill study and leverage different state-of-the-art deep generative models for\nChinese painting style transfer and evaluate the performance both qualitatively\nand quantitatively. In addition, we propose our own algorithm that combines\nseveral style transfer models for our task. Specifically, we will transfer two\nmain types of traditional Chinese painting style, known as \"Gong-bi\" and\n\"Shui-mo\" (to modern images like nature objects, portraits and landscapes.\n","authors":["Weijian Ma","Yanyang Kong"],"pdf_url":"https://arxiv.org/pdf/2310.09978v2.pdf","comment":"Paper is too old (written in 2019)"},{"id":"http://arxiv.org/abs/2304.10532v3","updated":"2023-10-17T18:15:06Z","published":"2023-04-20T17:59:05Z","title":"Nerfbusters: Removing Ghostly Artifacts from Casually Captured NeRFs","summary":" Casually captured Neural Radiance Fields (NeRFs) suffer from artifacts such\nas floaters or flawed geometry when rendered outside the camera trajectory.\nExisting evaluation protocols often do not capture these effects, since they\nusually only assess image quality at every 8th frame of the training capture.\nTo push forward progress in novel-view synthesis, we propose a new dataset and\nevaluation procedure, where two camera trajectories are recorded of the scene:\none used for training, and the other for evaluation. In this more challenging\nin-the-wild setting, we find that existing hand-crafted regularizers do not\nremove floaters nor improve scene geometry. Thus, we propose a 3D\ndiffusion-based method that leverages local 3D priors and a novel density-based\nscore distillation sampling loss to discourage artifacts during NeRF\noptimization. We show that this data-driven prior removes floaters and improves\nscene geometry for casual captures.\n","authors":["Frederik Warburg","Ethan Weber","Matthew Tancik","Aleksander Holynski","Angjoo Kanazawa"],"pdf_url":"https://arxiv.org/pdf/2304.10532v3.pdf","comment":"ICCV 2023, project page: https://ethanweber.me/nerfbusters"},{"id":"http://arxiv.org/abs/2310.00530v2","updated":"2023-10-17T18:14:02Z","published":"2023-10-01T00:21:01Z","title":"Enabling Neural Radiance Fields (NeRF) for Large-scale Aerial Images --\n A Multi-tiling Approach and the Geometry Assessment of NeRF","summary":" Neural Radiance Fields (NeRF) offer the potential to benefit 3D\nreconstruction tasks, including aerial photogrammetry. However, the scalability\nand accuracy of the inferred geometry are not well-documented for large-scale\naerial assets,since such datasets usually result in very high memory\nconsumption and slow convergence.. In this paper, we aim to scale the NeRF on\nlarge-scael aerial datasets and provide a thorough geometry assessment of NeRF.\nSpecifically, we introduce a location-specific sampling technique as well as a\nmulti-camera tiling (MCT) strategy to reduce memory consumption during image\nloading for RAM, representation training for GPU memory, and increase the\nconvergence rate within tiles. MCT decomposes a large-frame image into multiple\ntiled images with different camera models, allowing these small-frame images to\nbe fed into the training process as needed for specific locations without a\nloss of accuracy. We implement our method on a representative approach,\nMip-NeRF, and compare its geometry performance with threephotgrammetric MVS\npipelines on two typical aerial datasets against LiDAR reference data. Both\nqualitative and quantitative results suggest that the proposed NeRF approach\nproduces better completeness and object details than traditional approaches,\nalthough as of now, it still falls short in terms of accuracy.\n","authors":["Ningli Xu","Rongjun Qin","Debao Huang","Fabio Remondino"],"pdf_url":"https://arxiv.org/pdf/2310.00530v2.pdf","comment":"9 Figure"},{"id":"http://arxiv.org/abs/2308.11063v2","updated":"2023-10-17T18:13:48Z","published":"2023-08-21T22:16:49Z","title":"MetaGCD: Learning to Continually Learn in Generalized Category Discovery","summary":" In this paper, we consider a real-world scenario where a model that is\ntrained on pre-defined classes continually encounters unlabeled data that\ncontains both known and novel classes. The goal is to continually discover\nnovel classes while maintaining the performance in known classes. We name the\nsetting Continual Generalized Category Discovery (C-GCD). Existing methods for\nnovel class discovery cannot directly handle the C-GCD setting due to some\nunrealistic assumptions, such as the unlabeled data only containing novel\nclasses. Furthermore, they fail to discover novel classes in a continual\nfashion. In this work, we lift all these assumptions and propose an approach,\ncalled MetaGCD, to learn how to incrementally discover with less forgetting.\nOur proposed method uses a meta-learning framework and leverages the offline\nlabeled data to simulate the testing incremental learning process. A\nmeta-objective is defined to revolve around two conflicting learning objectives\nto achieve novel class discovery without forgetting. Furthermore, a soft\nneighborhood-based contrastive network is proposed to discriminate uncorrelated\nimages while attracting correlated images. We build strong baselines and\nconduct extensive experiments on three widely used benchmarks to demonstrate\nthe superiority of our method.\n","authors":["Yanan Wu","Zhixiang Chi","Yang Wang","Songhe Feng"],"pdf_url":"https://arxiv.org/pdf/2308.11063v2.pdf","comment":"This paper has been accepted by ICCV2023"},{"id":"http://arxiv.org/abs/2309.07891v3","updated":"2023-10-17T18:11:28Z","published":"2023-09-14T17:42:08Z","title":"HandNeRF: Learning to Reconstruct Hand-Object Interaction Scene from a\n Single RGB Image","summary":" This paper presents a method to learn hand-object interaction prior for\nreconstructing a 3D hand-object scene from a single RGB image. The inference as\nwell as training-data generation for 3D hand-object scene reconstruction is\nchallenging due to the depth ambiguity of a single image and occlusions by the\nhand and object. We turn this challenge into an opportunity by utilizing the\nhand shape to constrain the possible relative configuration of the hand and\nobject geometry. We design a generalizable implicit function, HandNeRF, that\nexplicitly encodes the correlation of the 3D hand shape features and 2D object\nfeatures to predict the hand and object scene geometry. With experiments on\nreal-world datasets, we show that HandNeRF is able to reconstruct hand-object\nscenes of novel grasp configurations more accurately than comparable methods.\nMoreover, we demonstrate that object reconstruction from HandNeRF ensures more\naccurate execution of downstream tasks, such as grasping and motion planning\nfor robotic hand-over and manipulation. The code will be release here:\nhttps://github.com/SamsungLabs/HandNeRF\n","authors":["Hongsuk Choi","Nikhil Chavan-Dafle","Jiacheng Yuan","Volkan Isler","Hyunsoo Park"],"pdf_url":"https://arxiv.org/pdf/2309.07891v3.pdf","comment":"12 pages including the supplementary material, 8 tables, 12 figures"},{"id":"http://arxiv.org/abs/2310.11482v1","updated":"2023-10-17T13:06:39Z","published":"2023-10-17T13:06:39Z","title":"Rethinking Class-incremental Learning in the Era of Large Pre-trained\n Models via Test-Time Adaptation","summary":" Class-incremental learning (CIL) is a challenging task that involves\ncontinually learning to categorize classes into new tasks without forgetting\npreviously learned information. The advent of the large pre-trained models\n(PTMs) has fast-tracked the progress in CIL due to the highly transferable PTM\nrepresentations, where tuning a small set of parameters results in\nstate-of-the-art performance when compared with the traditional CIL methods\nthat are trained from scratch. However, repeated fine-tuning on each task\ndestroys the rich representations of the PTMs and further leads to forgetting\nprevious tasks. To strike a balance between the stability and plasticity of\nPTMs for CIL, we propose a novel perspective of eliminating training on every\nnew task and instead performing test-time adaptation (TTA) directly on the test\ninstances. Concretely, we propose \"Test-Time Adaptation for Class-Incremental\nLearning\" (TTACIL) that first fine-tunes Layer Norm parameters of the PTM on\neach test instance for learning task-specific features, and then resets them\nback to the base model to preserve stability. As a consequence, TTACIL does not\nundergo any forgetting, while benefiting each task with the rich PTM features.\nAdditionally, by design, our method is robust to common data corruptions. Our\nTTACIL outperforms several state-of-the-art CIL methods when evaluated on\nmultiple CIL benchmarks under both clean and corrupted data.\n","authors":["Imad Eddine Marouf","Subhankar Roy","Enzo Tartaglione","Stéphane Lathuilière"],"pdf_url":"https://arxiv.org/pdf/2310.11482v1.pdf","comment":"8 pages,5 figures"},{"id":"http://arxiv.org/abs/2310.11480v1","updated":"2023-10-17T12:33:43Z","published":"2023-10-17T12:33:43Z","title":"Whole-brain radiomics for clustered federated personalization in brain\n tumor segmentation","summary":" Federated learning and its application to medical image segmentation have\nrecently become a popular research topic. This training paradigm suffers from\nstatistical heterogeneity between participating institutions' local datasets,\nincurring convergence slowdown as well as potential accuracy loss compared to\nclassical training. To mitigate this effect, federated personalization emerged\nas the federated optimization of one model per institution. We propose a novel\npersonalization algorithm tailored to the feature shift induced by the usage of\ndifferent scanners and acquisition parameters by different institutions. This\nmethod is the first to account for both inter and intra-institution feature\nshift (multiple scanners used in a single institution). It is based on the\ncomputation, within each centre, of a series of radiomic features capturing the\nglobal texture of each 3D image volume, followed by a clustering analysis\npooling all feature vectors transferred from the local institutions to the\ncentral server. Each computed clustered decentralized dataset (potentially\nincluding data from different institutions) then serves to finetune a global\nmodel obtained through classical federated learning. We validate our approach\non the Federated Brain Tumor Segmentation 2022 Challenge dataset (FeTS2022).\nOur code is available at (https://github.com/MatthisManthe/radiomics_CFFL).\n","authors":["Matthis Manthe","Stefan Duffner","Carole Lartizien"],"pdf_url":"https://arxiv.org/pdf/2310.11480v1.pdf","comment":"Accepted at Medical Imaging with Deep Learning (MiDL) 2023 conference"}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2310.11405v1","updated":"2023-10-17T17:13:13Z","published":"2023-10-17T17:13:13Z","title":"On Coherence-based Predictors for Dense Query Performance Prediction","summary":" Query Performance Prediction (QPP) estimates the effectiveness of a search\nengine's results in response to a query without relevance judgments.\nTraditionally, post-retrieval predictors have focused upon either the\ndistribution of the retrieval scores, or the coherence of the top-ranked\ndocuments using traditional bag-of-words index representations. More recently,\nBERT-based models using dense embedded document representations have been used\nto create new predictors, but mostly applied to predict the performance of\nrankings created by BM25. Instead, we aim to predict the effectiveness of\nrankings created by single-representation dense retrieval models (ANCE &\nTCT-ColBERT). Therefore, we propose a number of variants of existing\nunsupervised coherence-based predictors that employ neural embedding\nrepresentations. In our experiments on the TREC Deep Learning Track datasets,\nwe demonstrate improved accuracy upon dense retrieval (up to 92% compared to\nsparse variants for TCT-ColBERT and 188% for ANCE). Going deeper, we select the\nmost representative and best performing predictors to study the importance of\ndifferences among predictors and query types on query performance. Using\nexisting distribution-based evaluation QPP measures and a particular type of\nlinear mixed models, we find that query types further significantly influence\nquery performance (and are up to 35% responsible for the unstable performance\nof QPP predictors), and that this sensitivity is unique to dense retrieval\nmodels. Our approach introduces a new setting for obtaining richer information\non query differences in dense QPP that can explain potential unstable\nperformance of existing predictors and outlines the unique characteristics of\ndifferent query types on dense retrieval models.\n","authors":["Maria Vlachou","Craig Macdonald"],"pdf_url":"https://arxiv.org/pdf/2310.11405v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2204.01815v6","updated":"2023-10-17T17:06:34Z","published":"2022-04-04T19:42:46Z","title":"Tensor Completion with Provable Consistency and Fairness Guarantees for\n Recommender Systems","summary":" We introduce a new consistency-based approach for defining and solving\nnonnegative/positive matrix and tensor completion problems. The novelty of the\nframework is that instead of artificially making the problem well-posed in the\nform of an application-arbitrary optimization problem, e.g., minimizing a bulk\nstructural measure such as rank or norm, we show that a single\nproperty/constraint: preserving unit-scale consistency, guarantees the\nexistence of both a solution and, under relatively weak support assumptions,\nuniqueness. The framework and solution algorithms also generalize directly to\ntensors of arbitrary dimensions while maintaining computational complexity that\nis linear in problem size for fixed dimension d. In the context of recommender\nsystem (RS) applications, we prove that two reasonable properties that should\nbe expected to hold for any solution to the RS problem are sufficient to permit\nuniqueness guarantees to be established within our framework. This is\nremarkable because it obviates the need for heuristic-based statistical or AI\nmethods despite what appear to be distinctly human/subjective variables at the\nheart of the problem. Key theoretical contributions include a general\nunit-consistent tensor-completion framework with proofs of its properties,\ne.g., consensus-order and fairness, and algorithms with optimal runtime and\nspace complexities, e.g., O(1) term-completion with preprocessing complexity\nthat is linear in the number of known terms of the matrix/tensor. From a\npractical perspective, the seamless ability of the framework to generalize to\nexploit high-dimensional structural relationships among key state variables,\ne.g., user and product attributes, offers a means for extracting significantly\nmore information than is possible for alternative methods that cannot\ngeneralize beyond direct user-product relationships.\n","authors":["Tung Nguyen","Jeffrey Uhlmann"],"pdf_url":"https://arxiv.org/pdf/2204.01815v6.pdf","comment":"Final published version"},{"id":"http://arxiv.org/abs/2310.11270v1","updated":"2023-10-17T13:42:32Z","published":"2023-10-17T13:42:32Z","title":"Graph Neural Networks for Recommendation: Reproducibility, Graph\n Topology, and Node Representation","summary":" Graph neural networks (GNNs) have gained prominence in recommendation systems\nin recent years. By representing the user-item matrix as a bipartite and\nundirected graph, GNNs have demonstrated their potential to capture short- and\nlong-distance user-item interactions, thereby learning more accurate preference\npatterns than traditional recommendation approaches. In contrast to previous\ntutorials on the same topic, this tutorial aims to present and examine three\nkey aspects that characterize GNNs for recommendation: (i) the reproducibility\nof state-of-the-art approaches, (ii) the potential impact of graph topological\ncharacteristics on the performance of these models, and (iii) strategies for\nlearning node representations when training features from scratch or utilizing\npre-trained embeddings as additional item information (e.g., multimodal\nfeatures). The goal is to provide three novel theoretical and practical\nperspectives on the field, currently subject to debate in graph learning but\nlong been overlooked in the context of recommendation systems.\n","authors":["Daniele Malitesta","Claudio Pomo","Tommaso Di Noia"],"pdf_url":"https://arxiv.org/pdf/2310.11270v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.04633v3","updated":"2023-10-17T13:36:33Z","published":"2023-10-07T01:00:40Z","title":"Unbiased and Robust: External Attention-enhanced Graph Contrastive\n Learning for Cross-domain Sequential Recommendation","summary":" Cross-domain sequential recommenders (CSRs) are gaining considerable research\nattention as they can capture user sequential preference by leveraging side\ninformation from multiple domains. However, these works typically follow an\nideal setup, i.e., different domains obey similar data distribution, which\nignores the bias brought by asymmetric interaction densities (a.k.a. the\ninter-domain density bias). Besides, the frequently adopted mechanism (e.g.,\nthe self-attention network) in sequence encoder only focuses on the\ninteractions within a local view, which overlooks the global correlations\nbetween different training batches. To this end, we propose an External\nAttention-enhanced Graph Contrastive Learning framework, namely EA-GCL.\nSpecifically, to remove the impact of the inter-domain density bias, an\nauxiliary Self-Supervised Learning (SSL) task is attached to the traditional\ngraph encoder under a multi-task learning manner. To robustly capture users'\nbehavioral patterns, we develop an external attention-based sequence encoder\nthat contains an MLP-based memory-sharing structure. Unlike the self-attention\nmechanism, such a structure can effectively alleviate the bias interference\nfrom the batch-based training scheme. Extensive experiments on two real-world\ndatasets demonstrate that EA-GCL outperforms several state-of-the-art baselines\non CSR tasks. The source codes and relevant datasets are available at\nhttps://github.com/HoupingY/EA-GCL.\n","authors":["Xinhua Wang","Houping Yue","Zizheng Wang","Liancheng Xu","Jinyu Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.04633v3.pdf","comment":"9 pages, 4 figures, accepted by ICDM 2023 (workshop-GML4Rec)"},{"id":"http://arxiv.org/abs/2305.07609v3","updated":"2023-10-17T13:29:54Z","published":"2023-05-12T16:54:36Z","title":"Is ChatGPT Fair for Recommendation? Evaluating Fairness in Large\n Language Model Recommendation","summary":" The remarkable achievements of Large Language Models (LLMs) have led to the\nemergence of a novel recommendation paradigm -- Recommendation via LLM\n(RecLLM). Nevertheless, it is important to note that LLMs may contain social\nprejudices, and therefore, the fairness of recommendations made by RecLLM\nrequires further investigation. To avoid the potential risks of RecLLM, it is\nimperative to evaluate the fairness of RecLLM with respect to various sensitive\nattributes on the user side. Due to the differences between the RecLLM paradigm\nand the traditional recommendation paradigm, it is problematic to directly use\nthe fairness benchmark of traditional recommendation. To address the dilemma,\nwe propose a novel benchmark called Fairness of Recommendation via LLM\n(FaiRLLM). This benchmark comprises carefully crafted metrics and a dataset\nthat accounts for eight sensitive attributes1 in two recommendation scenarios:\nmusic and movies. By utilizing our FaiRLLM benchmark, we conducted an\nevaluation of ChatGPT and discovered that it still exhibits unfairness to some\nsensitive attributes when generating recommendations. Our code and dataset can\nbe found at https://github.com/jizhi-zhang/FaiRLLM.\n","authors":["Jizhi Zhang","Keqin Bao","Yang Zhang","Wenjie Wang","Fuli Feng","Xiangnan He"],"pdf_url":"https://arxiv.org/pdf/2305.07609v3.pdf","comment":"Accepted by Recsys 2023 (Short)"},{"id":"http://arxiv.org/abs/2305.00447v3","updated":"2023-10-17T13:29:42Z","published":"2023-04-30T10:55:56Z","title":"TALLRec: An Effective and Efficient Tuning Framework to Align Large\n Language Model with Recommendation","summary":" Large Language Models (LLMs) have demonstrated remarkable performance across\ndiverse domains, thereby prompting researchers to explore their potential for\nuse in recommendation systems. Initial attempts have leveraged the exceptional\ncapabilities of LLMs, such as rich knowledge and strong generalization through\nIn-context Learning, which involves phrasing the recommendation task as\nprompts. Nevertheless, the performance of LLMs in recommendation tasks remains\nsuboptimal due to a substantial disparity between the training tasks for LLMs\nand recommendation tasks, as well as inadequate recommendation data during\npre-training. To bridge the gap, we consider building a Large Recommendation\nLanguage Model by tunning LLMs with recommendation data. To this end, we\npropose an efficient and effective Tuning framework for Aligning LLMs with\nRecommendation, namely TALLRec. We have demonstrated that the proposed TALLRec\nframework can significantly enhance the recommendation capabilities of LLMs in\nthe movie and book domains, even with a limited dataset of fewer than 100\nsamples. Additionally, the proposed framework is highly efficient and can be\nexecuted on a single RTX 3090 with LLaMA-7B. Furthermore, the fine-tuned LLM\nexhibits robust cross-domain generalization. Our code and data are available at\nhttps://github.com/SAI990323/TALLRec.\n","authors":["Keqin Bao","Jizhi Zhang","Yang Zhang","Wenjie Wang","Fuli Feng","Xiangnan He"],"pdf_url":"https://arxiv.org/pdf/2305.00447v3.pdf","comment":"RecSys '23: Proceedings of the 17th ACM Conference on Recommender\n Systems; September 2023 Pages; 1007-1014"},{"id":"http://arxiv.org/abs/2308.09904v2","updated":"2023-10-17T11:48:10Z","published":"2023-08-19T04:46:01Z","title":"RAH! RecSys-Assistant-Human: A Human-Centered Recommendation Framework\n with LLM Agents","summary":" The rapid evolution of the web has led to an exponential growth in content.\nRecommender systems play a crucial role in Human-Computer Interaction (HCI) by\ntailoring content based on individual preferences. Despite their importance,\nchallenges persist in balancing recommendation accuracy with user satisfaction,\naddressing biases while preserving user privacy, and solving cold-start\nproblems in cross-domain situations. This research argues that addressing these\nissues is not solely the recommender systems' responsibility, and a\nhuman-centered approach is vital. We introduce the RAH Recommender system,\nAssistant, and Human) framework, an innovative solution with LLM-based agents\nsuch as Perceive, Learn, Act, Critic, and Reflect, emphasizing the alignment\nwith user personalities. The framework utilizes the Learn-Act-Critic loop and a\nreflection mechanism for improving user alignment. Using the real-world data,\nour experiments demonstrate the RAH framework's efficacy in various\nrecommendation domains, from reducing human burden to mitigating biases and\nenhancing user control. Notably, our contributions provide a human-centered\nrecommendation framework that partners effectively with various recommendation\nmodels.\n","authors":["Yubo Shu","Haonan Zhang","Hansu Gu","Peng Zhang","Tun Lu","Dongsheng Li","Ning Gu"],"pdf_url":"https://arxiv.org/pdf/2308.09904v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11088v1","updated":"2023-10-17T09:13:24Z","published":"2023-10-17T09:13:24Z","title":"MeKB-Rec: Personal Knowledge Graph Learning for Cross-Domain\n Recommendation","summary":" It is a long-standing challenge in modern recommender systems to effectively\nmake recommendations for new users, namely the cold-start problem. Cross-Domain\nRecommendation (CDR) has been proposed to address this challenge, but current\nways to represent users' interests across systems are still severely limited.\nWe introduce Personal Knowledge Graph (PKG) as a domain-invariant interest\nrepresentation, and propose a novel CDR paradigm named MeKB-Rec. We first link\nusers and entities in a knowledge base to construct a PKG of users' interests,\nnamed MeKB. Then we learn a semantic representation of MeKB for the\ncross-domain recommendation. To efficiently utilize limited training data in\nCDR, MeKB-Rec employs Pretrained Language Models to inject world knowledge into\nunderstanding users' interests. Beyond most existing systems, our approach\nbuilds a semantic mapping across domains which breaks the requirement for\nin-domain user behaviors, enabling zero-shot recommendations for new users in a\nlow-resource domain. We experiment MeKB-Rec on well-established public CDR\ndatasets, and demonstrate that the new formulation % is more powerful than\nprevious approaches, achieves a new state-of-the-art that significantly\nimproves HR@10 and NDCG@10 metrics over best previous approaches by 24\\%--91\\%,\nwith a 105\\% improvement for HR@10 of zero-shot users with no behavior in the\ntarget domain. We deploy MeKB-Rec in WeiXin recommendation scenarios and\nachieve significant gains in core online metrics. MeKB-Rec is now serving\nhundreds of millions of users in real-world products.\n","authors":["Xin Su","Yao Zhou","Zifei Shan","Qian Chen"],"pdf_url":"https://arxiv.org/pdf/2310.11088v1.pdf","comment":"13 pages, 4 figures, conference"},{"id":"http://arxiv.org/abs/2306.15010v3","updated":"2023-10-17T08:57:30Z","published":"2023-06-26T18:49:09Z","title":"Efficient High-Resolution Template Matching with Vector Quantized\n Nearest Neighbour Fields","summary":" Template matching is a fundamental problem in computer vision with\napplications in fields including object detection, image registration, and\nobject tracking. Current methods rely on nearest-neighbour (NN) matching, where\nthe query feature space is converted to NN space by representing each query\npixel with its NN in the template. NN-based methods have been shown to perform\nbetter in occlusions, appearance changes, and non-rigid transformations;\nhowever, they scale poorly with high-resolution data and high feature\ndimensions. We present an NN-based method which efficiently reduces the NN\ncomputations and introduces filtering in the NN fields (NNFs). A vector\nquantization step is introduced before the NN calculation to represent the\ntemplate with $k$ features, and the filter response over the NNFs is used to\ncompare the template and query distributions over the features. We show that\nstate-of-the-art performance is achieved in low-resolution data, and our method\noutperforms previous methods at higher resolution.\n","authors":["Ankit Gupta","Ida-Maria Sintorn"],"pdf_url":"https://arxiv.org/pdf/2306.15010v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11049v1","updated":"2023-10-17T07:35:11Z","published":"2023-10-17T07:35:11Z","title":"Nonet at SemEval-2023 Task 6: Methodologies for Legal Evaluation","summary":" This paper describes our submission to the SemEval-2023 for Task 6 on\nLegalEval: Understanding Legal Texts. Our submission concentrated on three\nsubtasks: Legal Named Entity Recognition (L-NER) for Task-B, Legal Judgment\nPrediction (LJP) for Task-C1, and Court Judgment Prediction with Explanation\n(CJPE) for Task-C2. We conducted various experiments on these subtasks and\npresented the results in detail, including data statistics and methodology. It\nis worth noting that legal tasks, such as those tackled in this research, have\nbeen gaining importance due to the increasing need to automate legal analysis\nand support. Our team obtained competitive rankings of 15$^{th}$, 11$^{th}$,\nand 1$^{st}$ in Task-B, Task-C1, and Task-C2, respectively, as reported on the\nleaderboard.\n","authors":["Shubham Kumar Nigam","Aniket Deroy","Noel Shallum","Ayush Kumar Mishra","Anup Roy","Shubham Kumar Mishra","Arnab Bhattacharya","Saptarshi Ghosh","Kripabandhu Ghosh"],"pdf_url":"https://arxiv.org/pdf/2310.11049v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2105.01331v3","updated":"2023-10-17T07:30:40Z","published":"2021-05-04T07:27:42Z","title":"BLM-17m: A Large-Scale Dataset for Black Lives Matter Topic Detection on\n Twitter","summary":" Protection of human rights is one of the most important problems of our\nworld. In this paper, our aim is to provide a dataset which covers one of the\nmost significant human rights contradiction in recent months affected the whole\nworld, George Floyd incident. We propose a labeled dataset for topic detection\nthat contains 17 million tweets. These Tweets are collected from 25 May 2020 to\n21 August 2020 that covers 89 days from start of this incident. We labeled the\ndataset by monitoring most trending news topics from global and local\nnewspapers. Apart from that, we present two baselines, TF-IDF and LDA. We\nevaluated the results of these two methods with three different k values for\nmetrics of precision, recall and f1-score. The collected dataset is available\nat https://github.com/MeysamAsgariC/BLMT.\n","authors":["Hasan Kemik","Nusret Özateş","Meysam Asgari-Chenaghlu","Yang Li","Erik Cambria"],"pdf_url":"https://arxiv.org/pdf/2105.01331v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.05521v3","updated":"2023-10-17T07:25:38Z","published":"2022-10-11T15:12:41Z","title":"Hybrid Inverted Index Is a Robust Accelerator for Dense Retrieval","summary":" Inverted file structure is a common technique for accelerating dense\nretrieval. It clusters documents based on their embeddings; during searching,\nit probes nearby clusters w.r.t. an input query and only evaluates documents\nwithin them by subsequent codecs, thus avoiding the expensive cost of\nexhaustive traversal. However, the clustering is always lossy, which results in\nthe miss of relevant documents in the probed clusters and hence degrades\nretrieval quality. In contrast, lexical matching, such as overlaps of salient\nterms, tends to be strong feature for identifying relevant documents. In this\nwork, we present the Hybrid Inverted Index (HI$^2$), where the embedding\nclusters and salient terms work collaboratively to accelerate dense retrieval.\nTo make best of both effectiveness and efficiency, we devise a cluster selector\nand a term selector, to construct compact inverted lists and efficiently\nsearching through them. Moreover, we leverage simple unsupervised algorithms as\nwell as end-to-end knowledge distillation to learn these two modules, with the\nlatter further boosting the effectiveness. Based on comprehensive experiments\non popular retrieval benchmarks, we verify that clusters and terms indeed\ncomplement each other, enabling HI$^2$ to achieve lossless retrieval quality\nwith competitive efficiency across various index settings. Our code and\ncheckpoint are publicly available at\nhttps://github.com/namespace-Pt/Adon/tree/HI2.\n","authors":["Peitian Zhang","Zheng Liu","Shitao Xiao","Zhicheng Dou","Jing Yao"],"pdf_url":"https://arxiv.org/pdf/2210.05521v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09234v2","updated":"2023-10-17T04:53:08Z","published":"2023-10-13T16:37:53Z","title":"ClickPrompt: CTR Models are Strong Prompt Generators for Adapting\n Language Models to CTR Prediction","summary":" Click-through rate (CTR) prediction has become increasingly indispensable for\nvarious Internet applications. Traditional CTR models convert the multi-field\ncategorical data into ID features via one-hot encoding, and extract the\ncollaborative signals among features. Such a paradigm suffers from the problem\nof semantic information loss. Another line of research explores the potential\nof pretrained language models (PLMs) for CTR prediction by converting input\ndata into textual sentences through hard prompt templates. Although semantic\nsignals are preserved, they generally fail to capture the collaborative\ninformation (e.g., feature interactions, pure ID features), not to mention the\nunacceptable inference overhead brought by the huge model size. In this paper,\nwe aim to model both the semantic knowledge and collaborative knowledge for\naccurate CTR estimation, and meanwhile address the inference inefficiency\nissue. To benefit from both worlds and close their gaps, we propose a novel\nmodel-agnostic framework (i.e., ClickPrompt), where we incorporate CTR models\nto generate interaction-aware soft prompts for PLMs. We design a\nprompt-augmented masked language modeling (PA-MLM) pretraining task, where PLM\nhas to recover the masked tokens based on the language context, as well as the\nsoft prompts generated by CTR model. The collaborative and semantic knowledge\nfrom ID and textual features would be explicitly aligned and interacted via the\nprompt interface. Then, we can either tune the CTR model with PLM for superior\nperformance, or solely tune the CTR model without PLM for inference efficiency.\nExperiments on four real-world datasets validate the effectiveness of\nClickPrompt compared with existing baselines.\n","authors":["Jianghao Lin","Bo Chen","Hangyu Wang","Yunjia Xi","Yanru Qu","Xinyi Dai","Kangning Zhang","Ruiming Tang","Yong Yu","Weinan Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.09234v2.pdf","comment":"under review"},{"id":"http://arxiv.org/abs/2212.10764v2","updated":"2023-10-17T22:15:38Z","published":"2022-12-21T04:49:55Z","title":"Learning List-Level Domain-Invariant Representations for Ranking","summary":" Domain adaptation aims to transfer the knowledge learned on (data-rich)\nsource domains to (low-resource) target domains, and a popular method is\ninvariant representation learning, which matches and aligns the data\ndistributions on the feature space. Although this method is studied extensively\nand applied on classification and regression problems, its adoption on ranking\nproblems is sporadic, and the few existing implementations lack theoretical\njustifications. This paper revisits invariant representation learning for\nranking. Upon reviewing prior work, we found that they implement what we call\nitem-level alignment, which aligns the distributions of the items being ranked\nfrom all lists in aggregate but ignores their list structure. However, the list\nstructure should be leveraged, because it is intrinsic to ranking problems\nwhere the data and the metrics are defined and computed on lists, not the items\nby themselves. To close this discrepancy, we propose list-level alignment --\nlearning domain-invariant representations at the higher level of lists. The\nbenefits are twofold: it leads to the first domain adaptation generalization\nbound for ranking, in turn providing theoretical support for the proposed\nmethod, and it achieves better empirical transfer performance for unsupervised\ndomain adaptation on ranking tasks, including passage reranking.\n","authors":["Ruicheng Xian","Honglei Zhuang","Zhen Qin","Hamed Zamani","Jing Lu","Ji Ma","Kai Hui","Han Zhao","Xuanhui Wang","Michael Bendersky"],"pdf_url":"https://arxiv.org/pdf/2212.10764v2.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2302.09473v2","updated":"2023-10-17T22:01:00Z","published":"2023-02-19T04:03:22Z","title":"Video-Text Retrieval by Supervised Sparse Multi-Grained Learning","summary":" While recent progress in video-text retrieval has been advanced by the\nexploration of better representation learning, in this paper, we present a\nnovel multi-grained sparse learning framework, S3MA, to learn an aligned sparse\nspace shared between the video and the text for video-text retrieval. The\nshared sparse space is initialized with a finite number of sparse concepts,\neach of which refers to a number of words. With the text data at hand, we learn\nand update the shared sparse space in a supervised manner using the proposed\nsimilarity and alignment losses. Moreover, to enable multi-grained alignment,\nwe incorporate frame representations for better modeling the video modality and\ncalculating fine-grained and coarse-grained similarities. Benefiting from the\nlearned shared sparse space and multi-grained similarities, extensive\nexperiments on several video-text retrieval benchmarks demonstrate the\nsuperiority of S3MA over existing methods. Our code is available at\nhttps://github.com/yimuwangcs/Better_Cross_Modal_Retrieval.\n","authors":["Yimu Wang","Peng Shi"],"pdf_url":"https://arxiv.org/pdf/2302.09473v2.pdf","comment":"Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2301.03560v2","updated":"2023-10-17T21:42:38Z","published":"2023-01-09T18:20:55Z","title":"Solo: Data Discovery Using Natural Language Questions Via A\n Self-Supervised Approach","summary":" Most deployed data discovery systems, such as Google Datasets, and open data\nportals only support keyword search. Keyword search is geared towards general\naudiences but limits the types of queries the systems can answer. We propose a\nnew system that lets users write natural language questions directly. A major\nbarrier to using this learned data discovery system is it needs\nexpensive-to-collect training data, thus limiting its utility. In this paper,\nwe introduce a self-supervised approach to assemble training datasets and train\nlearned discovery systems without human intervention. It requires addressing\nseveral challenges, including the design of self-supervised strategies for data\ndiscovery, table representation strategies to feed to the models, and relevance\nmodels that work well with the synthetically generated questions. We combine\nall the above contributions into a system, Solo, that solves the problem end to\nend. The evaluation results demonstrate the new techniques outperform\nstate-of-the-art approaches on well-known benchmarks. All in all, the technique\nis a stepping stone towards building learned discovery systems. The code is\nopen-sourced at https://github.com/TheDataStation/solo\n","authors":["Qiming Wang","Raul Castro Fernandez"],"pdf_url":"https://arxiv.org/pdf/2301.03560v2.pdf","comment":"To appear at Sigmod 2024"}],"Machine Learning":[{"id":"http://arxiv.org/abs/2310.11451v1","updated":"2023-10-17T17:58:34Z","published":"2023-10-17T17:58:34Z","title":"Seeking Neural Nuggets: Knowledge Transfer in Large Language Models from\n a Parametric Perspective","summary":" Large Language Models (LLMs) inherently encode a wealth of knowledge within\ntheir parameters through pre-training on extensive corpora. While prior\nresearch has delved into operations on these parameters to manipulate the\nunderlying implicit knowledge (encompassing detection, editing, and merging),\nthere remains an ambiguous understanding regarding their transferability across\nmodels with varying scales. In this paper, we seek to empirically investigate\nknowledge transfer from larger to smaller models through a parametric\nperspective. To achieve this, we employ sensitivity-based techniques to extract\nand align knowledge-specific parameters between different LLMs. Moreover, the\nLoRA module is used as the intermediary mechanism for injecting the extracted\nknowledge into smaller models. Evaluations across four benchmarks validate the\nefficacy of our proposed method. Our findings highlight the critical factors\ncontributing to the process of parametric knowledge transfer, underscoring the\ntransferability of model parameters across LLMs of different scales. We release\ncode and data at \\url{https://github.com/maszhongming/ParaKnowTransfer}.\n","authors":["Ming Zhong","Chenxin An","Weizhu Chen","Jiawei Han","Pengcheng He"],"pdf_url":"https://arxiv.org/pdf/2310.11451v1.pdf","comment":"Preprint"},{"id":"http://arxiv.org/abs/2310.11450v1","updated":"2023-10-17T17:58:19Z","published":"2023-10-17T17:58:19Z","title":"Explaining Deep Neural Networks for Bearing Fault Detection with\n Vibration Concepts","summary":" Concept-based explanation methods, such as Concept Activation Vectors, are\npotent means to quantify how abstract or high-level characteristics of input\ndata influence the predictions of complex deep neural networks. However,\napplying them to industrial prediction problems is challenging as it is not\nimmediately clear how to define and access appropriate concepts for individual\nuse cases and specific data types. In this work, we investigate how to leverage\nestablished concept-based explanation techniques in the context of bearing\nfault detection with deep neural networks trained on vibration signals. Since\nbearings are prevalent in almost every rotating equipment, ensuring the\nreliability of intransparent fault detection models is crucial to prevent\ncostly repairs and downtimes of industrial machinery. Our evaluations\ndemonstrate that explaining opaque models in terms of vibration concepts\nenables human-comprehensible and intuitive insights about their inner workings,\nbut the underlying assumptions need to be carefully validated first.\n","authors":["Thomas Decker","Michael Lebacher","Volker Tresp"],"pdf_url":"https://arxiv.org/pdf/2310.11450v1.pdf","comment":"2023 IEEE 21st International Conference on Industrial Informatics\n (INDIN)"},{"id":"http://arxiv.org/abs/2310.11449v1","updated":"2023-10-17T17:58:00Z","published":"2023-10-17T17:58:00Z","title":"DELIFFAS: Deformable Light Fields for Fast Avatar Synthesis","summary":" Generating controllable and photorealistic digital human avatars is a\nlong-standing and important problem in Vision and Graphics. Recent methods have\nshown great progress in terms of either photorealism or inference speed while\nthe combination of the two desired properties still remains unsolved. To this\nend, we propose a novel method, called DELIFFAS, which parameterizes the\nappearance of the human as a surface light field that is attached to a\ncontrollable and deforming human mesh model. At the core, we represent the\nlight field around the human with a deformable two-surface parameterization,\nwhich enables fast and accurate inference of the human appearance. This allows\nperceptual supervision on the full image compared to previous approaches that\ncould only supervise individual pixels or small patches due to their slow\nruntime. Our carefully designed human representation and supervision strategy\nleads to state-of-the-art synthesis results and inference time. The video\nresults and code are available at\nhttps://vcai.mpi-inf.mpg.de/projects/DELIFFAS.\n","authors":["Youngjoong Kwon","Lingjie Liu","Henry Fuchs","Marc Habermann","Christian Theobalt"],"pdf_url":"https://arxiv.org/pdf/2310.11449v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.10649v2","updated":"2023-10-17T17:55:33Z","published":"2023-10-16T17:59:54Z","title":"A Computational Framework for Solving Wasserstein Lagrangian Flows","summary":" The dynamical formulation of the optimal transport can be extended through\nvarious choices of the underlying geometry ($\\textit{kinetic energy}$), and the\nregularization of density paths ($\\textit{potential energy}$). These\ncombinations yield different variational problems ($\\textit{Lagrangians}$),\nencompassing many variations of the optimal transport problem such as the\nSchr\\\"odinger bridge, unbalanced optimal transport, and optimal transport with\nphysical constraints, among others. In general, the optimal density path is\nunknown, and solving these variational problems can be computationally\nchallenging. Leveraging the dual formulation of the Lagrangians, we propose a\nnovel deep learning based framework approaching all of these problems from a\nunified perspective. Our method does not require simulating or backpropagating\nthrough the trajectories of the learned dynamics, and does not need access to\noptimal couplings. We showcase the versatility of the proposed framework by\noutperforming previous approaches for the single-cell trajectory inference,\nwhere incorporating prior knowledge into the dynamics is crucial for correct\npredictions.\n","authors":["Kirill Neklyudov","Rob Brekelmans","Alexander Tong","Lazar Atanackovic","Qiang Liu","Alireza Makhzani"],"pdf_url":"https://arxiv.org/pdf/2310.10649v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11445v1","updated":"2023-10-17T17:55:32Z","published":"2023-10-17T17:55:32Z","title":"Stochastic Quantum Sampling for Non-Logconcave Distributions and\n Estimating Partition Functions","summary":" We present quantum algorithms for sampling from non-logconcave probability\ndistributions in the form of $\\pi(x) \\propto \\exp(-\\beta f(x))$. Here, $f$ can\nbe written as a finite sum $f(x):= \\frac{1}{N}\\sum_{k=1}^N f_k(x)$. Our\napproach is based on quantum simulated annealing on slowly varying Markov\nchains derived from unadjusted Langevin algorithms, removing the necessity for\nfunction evaluations which can be computationally expensive for large data sets\nin mixture modeling and multi-stable systems. We also incorporate a stochastic\ngradient oracle that implements the quantum walk operators inexactly by only\nusing mini-batch gradients. As a result, our stochastic gradient based\nalgorithm only accesses small subsets of data points in implementing the\nquantum walk. One challenge of quantizing the resulting Markov chains is that\nthey do not satisfy the detailed balance condition in general. Consequently,\nthe mixing time of the algorithm cannot be expressed in terms of the spectral\ngap of the transition density, making the quantum algorithms nontrivial to\nanalyze. To overcome these challenges, we first build a hypothetical Markov\nchain that is reversible, and also converges to the target distribution. Then,\nwe quantified the distance between our algorithm's output and the target\ndistribution by using this hypothetical chain as a bridge to establish the\ntotal complexity. Our quantum algorithms exhibit polynomial speedups in terms\nof both dimension and precision dependencies when compared to the best-known\nclassical algorithms.\n","authors":["Guneykan Ozgul","Xiantao Li","Mehrdad Mahdavi","Chunhao Wang"],"pdf_url":"https://arxiv.org/pdf/2310.11445v1.pdf","comment":"32 pages"},{"id":"http://arxiv.org/abs/2310.09986v2","updated":"2023-10-17T17:50:36Z","published":"2023-10-15T23:59:57Z","title":"On Statistical Learning of Branch and Bound for Vehicle Routing\n Optimization","summary":" Recently, machine learning of the branch and bound algorithm has shown\npromise in approximating competent solutions to NP-hard problems. In this\npaper, we utilize and comprehensively compare the outcomes of three neural\nnetworks--graph convolutional neural network (GCNN), GraphSAGE, and graph\nattention network (GAT)--to solve the capacitated vehicle routing problem. We\ntrain these neural networks to emulate the decision-making process of the\ncomputationally expensive Strong Branching strategy. The neural networks are\ntrained on six instances with distinct topologies from the CVRPLIB and\nevaluated on eight additional instances. Moreover, we reduced the minimum\nnumber of vehicles required to solve a CVRP instance to a bin-packing problem,\nwhich was addressed in a similar manner. Through rigorous experimentation, we\nfound that this approach can match or improve upon the performance of the\nbranch and bound algorithm with the Strong Branching strategy while requiring\nsignificantly less computational time. The source code that corresponds to our\nresearch findings and methodology is readily accessible and available for\nreference at the following web address: https://isotlaboratory.github.io/ml4vrp\n","authors":["Andrew Naguib","Waleed A. Yousef","Issa Traoré","Mohammad Mamun"],"pdf_url":"https://arxiv.org/pdf/2310.09986v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11439v1","updated":"2023-10-17T17:50:22Z","published":"2023-10-17T17:50:22Z","title":"Understanding deep neural networks through the lens of their\n non-linearity","summary":" The remarkable success of deep neural networks (DNN) is often attributed to\ntheir high expressive power and their ability to approximate functions of\narbitrary complexity. Indeed, DNNs are highly non-linear models, and activation\nfunctions introduced into them are largely responsible for this. While many\nworks studied the expressive power of DNNs through the lens of their\napproximation capabilities, quantifying the non-linearity of DNNs or of\nindividual activation functions remains an open problem. In this paper, we\npropose the first theoretically sound solution to track non-linearity\npropagation in deep neural networks with a specific focus on computer vision\napplications. Our proposed affinity score allows us to gain insights into the\ninner workings of a wide range of different architectures and learning\nparadigms. We provide extensive experimental results that highlight the\npractical utility of the proposed affinity score and its potential for\nlong-reaching applications.\n","authors":["Quentin Bouniot","Ievgen Redko","Anton Mallasto","Charlotte Laclau","Karol Arndt","Oliver Struckmeier","Markus Heinonen","Ville Kyrki","Samuel Kaski"],"pdf_url":"https://arxiv.org/pdf/2310.11439v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.09447v2","updated":"2023-10-17T17:46:01Z","published":"2023-03-16T16:23:13Z","title":"Steering Prototypes with Prompt-tuning for Rehearsal-free Continual\n Learning","summary":" In the context of continual learning, prototypes-as representative class\nembeddings-offer advantages in memory conservation and the mitigation of\ncatastrophic forgetting. However, challenges related to semantic drift and\nprototype interference persist. In this study, we introduce the Contrastive\nPrototypical Prompt (CPP) approach. Through task-specific prompt-tuning,\nunderpinned by a contrastive learning objective, we effectively address both\naforementioned challenges. Our evaluations on four challenging\nclass-incremental benchmarks reveal that CPP achieves a significant 4% to 6%\nimprovement over state-of-the-art methods. Importantly, CPP operates without a\nrehearsal buffer and narrows the performance divergence between continual and\noffline joint-learning, suggesting an innovative scheme for Transformer-based\ncontinual learning systems.\n","authors":["Zhuowei Li","Long Zhao","Zizhao Zhang","Han Zhang","Di Liu","Ting Liu","Dimitris N. Metaxas"],"pdf_url":"https://arxiv.org/pdf/2303.09447v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11431v1","updated":"2023-10-17T17:41:28Z","published":"2023-10-17T17:41:28Z","title":"Identifying Interpretable Visual Features in Artificial and Biological\n Neural Systems","summary":" Single neurons in neural networks are often ``interpretable'' in that they\nrepresent individual, intuitively meaningful features. However, many neurons\nexhibit $\\textit{mixed selectivity}$, i.e., they represent multiple unrelated\nfeatures. A recent hypothesis proposes that features in deep networks may be\nrepresented in $\\textit{superposition}$, i.e., on non-orthogonal axes by\nmultiple neurons, since the number of possible interpretable features in\nnatural data is generally larger than the number of neurons in a given network.\nAccordingly, we should be able to find meaningful directions in activation\nspace that are not aligned with individual neurons. Here, we propose (1) an\nautomated method for quantifying visual interpretability that is validated\nagainst a large database of human psychophysics judgments of neuron\ninterpretability, and (2) an approach for finding meaningful directions in\nnetwork activation space. We leverage these methods to discover directions in\nconvolutional neural networks that are more intuitively meaningful than\nindividual neurons, as we confirm and investigate in a series of analyses.\nMoreover, we apply the same method to two recent datasets of visual neural\nresponses in the brain and find that our conclusions largely transfer to real\nneural data, suggesting that superposition might be deployed by the brain. This\nalso provides a link with disentanglement and raises fundamental questions\nabout robust, efficient and factorized representations in both artificial and\nbiological neural systems.\n","authors":["David Klindt","Sophia Sanborn","Francisco Acosta","Frédéric Poitevin","Nina Miolane"],"pdf_url":"https://arxiv.org/pdf/2310.11431v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11428v1","updated":"2023-10-17T17:39:40Z","published":"2023-10-17T17:39:40Z","title":"Butterfly Effects of SGD Noise: Error Amplification in Behavior Cloning\n and Autoregression","summary":" This work studies training instabilities of behavior cloning with deep neural\nnetworks. We observe that minibatch SGD updates to the policy network during\ntraining result in sharp oscillations in long-horizon rewards, despite\nnegligibly affecting the behavior cloning loss. We empirically disentangle the\nstatistical and computational causes of these oscillations, and find them to\nstem from the chaotic propagation of minibatch SGD noise through unstable\nclosed-loop dynamics. While SGD noise is benign in the single-step action\nprediction objective, it results in catastrophic error accumulation over long\nhorizons, an effect we term gradient variance amplification (GVA). We show that\nmany standard mitigation techniques do not alleviate GVA, but find an\nexponential moving average (EMA) of iterates to be surprisingly effective at\ndoing so. We illustrate the generality of this phenomenon by showing the\nexistence of GVA and its amelioration by EMA in both continuous control and\nautoregressive language generation. Finally, we provide theoretical vignettes\nthat highlight the benefits of EMA in alleviating GVA and shed light on the\nextent to which classical convex models can help in understanding the benefits\nof iterate averaging in deep learning.\n","authors":["Adam Block","Dylan J. Foster","Akshay Krishnamurthy","Max Simchowitz","Cyril Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.11428v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11407v1","updated":"2023-10-17T17:14:07Z","published":"2023-10-17T17:14:07Z","title":"Group-blind optimal transport to group parity and its constrained\n variants","summary":" Fairness holds a pivotal role in the realm of machine learning, particularly\nwhen it comes to addressing groups categorised by sensitive attributes, e.g.,\ngender, race. Prevailing algorithms in fair learning predominantly hinge on\naccessibility or estimations of these sensitive attributes, at least in the\ntraining process. We design a single group-blind projection map that aligns the\nfeature distributions of both groups in the source data, achieving\n(demographic) group parity, without requiring values of the protected attribute\nfor individual samples in the computation of the map, as well as its use.\nInstead, our approach utilises the feature distributions of the privileged and\nunprivileged groups in a boarder population and the essential assumption that\nthe source data are unbiased representation of the population. We present\nnumerical results on synthetic data and real data.\n","authors":["Quan Zhou","Jakub Marecek"],"pdf_url":"https://arxiv.org/pdf/2310.11407v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11401v1","updated":"2023-10-17T17:10:56Z","published":"2023-10-17T17:10:56Z","title":"Enhancing Group Fairness in Online Settings Using Oblique Decision\n Forests","summary":" Fairness, especially group fairness, is an important consideration in the\ncontext of machine learning systems. The most commonly adopted group\nfairness-enhancing techniques are in-processing methods that rely on a mixture\nof a fairness objective (e.g., demographic parity) and a task-specific\nobjective (e.g., cross-entropy) during the training process. However, when data\narrives in an online fashion -- one instance at a time -- optimizing such\nfairness objectives poses several challenges. In particular, group fairness\nobjectives are defined using expectations of predictions across different\ndemographic groups. In the online setting, where the algorithm has access to a\nsingle instance at a time, estimating the group fairness objective requires\nadditional storage and significantly more computation (e.g., forward/backward\npasses) than the task-specific objective at every time step. In this paper, we\npropose Aranyani, an ensemble of oblique decision trees, to make fair decisions\nin online settings. The hierarchical tree structure of Aranyani enables\nparameter isolation and allows us to efficiently compute the fairness gradients\nusing aggregate statistics of previous decisions, eliminating the need for\nadditional storage and forward/backward passes. We also present an efficient\nframework to train Aranyani and theoretically analyze several of its\nproperties. We conduct empirical evaluations on 5 publicly available benchmarks\n(including vision and language datasets) to show that Aranyani achieves a\nbetter accuracy-fairness trade-off compared to baseline approaches.\n","authors":["Somnath Basu Roy Chowdhury","Nicholas Monath","Ahmad Beirami","Rahul Kidambi","Avinava Dubey","Amr Ahmed","Snigdha Chaturvedi"],"pdf_url":"https://arxiv.org/pdf/2310.11401v1.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2204.01815v6","updated":"2023-10-17T17:06:34Z","published":"2022-04-04T19:42:46Z","title":"Tensor Completion with Provable Consistency and Fairness Guarantees for\n Recommender Systems","summary":" We introduce a new consistency-based approach for defining and solving\nnonnegative/positive matrix and tensor completion problems. The novelty of the\nframework is that instead of artificially making the problem well-posed in the\nform of an application-arbitrary optimization problem, e.g., minimizing a bulk\nstructural measure such as rank or norm, we show that a single\nproperty/constraint: preserving unit-scale consistency, guarantees the\nexistence of both a solution and, under relatively weak support assumptions,\nuniqueness. The framework and solution algorithms also generalize directly to\ntensors of arbitrary dimensions while maintaining computational complexity that\nis linear in problem size for fixed dimension d. In the context of recommender\nsystem (RS) applications, we prove that two reasonable properties that should\nbe expected to hold for any solution to the RS problem are sufficient to permit\nuniqueness guarantees to be established within our framework. This is\nremarkable because it obviates the need for heuristic-based statistical or AI\nmethods despite what appear to be distinctly human/subjective variables at the\nheart of the problem. Key theoretical contributions include a general\nunit-consistent tensor-completion framework with proofs of its properties,\ne.g., consensus-order and fairness, and algorithms with optimal runtime and\nspace complexities, e.g., O(1) term-completion with preprocessing complexity\nthat is linear in the number of known terms of the matrix/tensor. From a\npractical perspective, the seamless ability of the framework to generalize to\nexploit high-dimensional structural relationships among key state variables,\ne.g., user and product attributes, offers a means for extracting significantly\nmore information than is possible for alternative methods that cannot\ngeneralize beyond direct user-product relationships.\n","authors":["Tung Nguyen","Jeffrey Uhlmann"],"pdf_url":"https://arxiv.org/pdf/2204.01815v6.pdf","comment":"Final published version"},{"id":"http://arxiv.org/abs/2303.04614v2","updated":"2023-10-17T17:06:04Z","published":"2023-03-08T14:35:03Z","title":"Densely Connected $G$-invariant Deep Neural Networks with Signed\n Permutation Representations","summary":" We introduce and investigate, for finite groups $G$, $G$-invariant deep\nneural network ($G$-DNN) architectures with ReLU activation that are densely\nconnected-- i.e., include all possible skip connections. In contrast to other\n$G$-invariant architectures in the literature, the preactivations of\nthe$G$-DNNs presented here are able to transform by \\emph{signed} permutation\nrepresentations (signed perm-reps) of $G$. Moreover, the individual layers of\nthe $G$-DNNs are not required to be $G$-equivariant; instead, the\npreactivations are constrained to be $G$-equivariant functions of the network\ninput in a way that couples weights across all layers. The result is a richer\nfamily of $G$-invariant architectures never seen previously. We derive an\nefficient implementation of $G$-DNNs after a reparameterization of weights, as\nwell as necessary and sufficient conditions for an architecture to be\n``admissible''-- i.e., nondegenerate and inequivalent to smaller architectures.\nWe include code that allows a user to build a $G$-DNN interactively\nlayer-by-layer, with the final architecture guaranteed to be admissible. We\nshow that there are far more admissible $G$-DNN architectures than those\naccessible with the ``concatenated ReLU'' activation function from the\nliterature. Finally, we apply $G$-DNNs to two example problems -- (1)\nmultiplication in $\\{-1, 1\\}$ (with theoretical guarantees) and (2) 3D object\nclassification -- % finding that the inclusion of signed perm-reps\nsignificantly boosts predictive performance compared to baselines with only\nordinary (i.e., unsigned) perm-reps.\n","authors":["Devanshu Agrawal","James Ostrowski"],"pdf_url":"https://arxiv.org/pdf/2303.04614v2.pdf","comment":"40 pages, 2 figures, 4 tables. For associated code repository see\n https://github.com/dagrawa2/gdnn_code"},{"id":"http://arxiv.org/abs/2310.11397v1","updated":"2023-10-17T17:03:00Z","published":"2023-10-17T17:03:00Z","title":"Last One Standing: A Comparative Analysis of Security and Privacy of\n Soft Prompt Tuning, LoRA, and In-Context Learning","summary":" Large Language Models (LLMs) are powerful tools for natural language\nprocessing, enabling novel applications and user experiences. However, to\nachieve optimal performance, LLMs often require adaptation with private data,\nwhich poses privacy and security challenges. Several techniques have been\nproposed to adapt LLMs with private data, such as Low-Rank Adaptation (LoRA),\nSoft Prompt Tuning (SPT), and In-Context Learning (ICL), but their comparative\nprivacy and security properties have not been systematically investigated. In\nthis work, we fill this gap by evaluating the robustness of LoRA, SPT, and ICL\nagainst three types of well-established attacks: membership inference, which\nexposes data leakage (privacy); backdoor, which injects malicious behavior\n(security); and model stealing, which can violate intellectual property\n(privacy and security). Our results show that there is no silver bullet for\nprivacy and security in LLM adaptation and each technique has different\nstrengths and weaknesses.\n","authors":["Rui Wen","Tianhao Wang","Michael Backes","Yang Zhang","Ahmed Salem"],"pdf_url":"https://arxiv.org/pdf/2310.11397v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.05372v2","updated":"2023-10-17T16:56:55Z","published":"2023-02-10T16:50:40Z","title":"Towards Minimax Optimality of Model-based Robust Reinforcement Learning","summary":" We study the sample complexity of obtaining an $\\epsilon$-optimal policy in\n\\emph{Robust} discounted Markov Decision Processes (RMDPs), given only access\nto a generative model of the nominal kernel. This problem is widely studied in\nthe non-robust case, and it is known that any planning approach applied to an\nempirical MDP estimated with $\\tilde{\\mathcal{O}}(\\frac{H^3 \\mid S \\mid\\mid A\n\\mid}{\\epsilon^2})$ samples provides an $\\epsilon$-optimal policy, which is\nminimax optimal. Results in the robust case are much more scarce. For $sa$-\n(resp $s$-)rectangular uncertainty sets, the best known sample complexity is\n$\\tilde{\\mathcal{O}}(\\frac{H^4 \\mid S \\mid^2\\mid A \\mid}{\\epsilon^2})$ (resp.\n$\\tilde{\\mathcal{O}}(\\frac{H^4 \\mid S \\mid^2\\mid A \\mid^2}{\\epsilon^2})$), for\nspecific algorithms and when the uncertainty set is based on the total\nvariation (TV), the KL or the Chi-square divergences. In this paper, we\nconsider uncertainty sets defined with an $L_p$-ball (recovering the TV case),\nand study the sample complexity of \\emph{any} planning algorithm (with high\naccuracy guarantee on the solution) applied to an empirical RMDP estimated\nusing the generative model. In the general case, we prove a sample complexity\nof $\\tilde{\\mathcal{O}}(\\frac{H^4 \\mid S \\mid\\mid A \\mid}{\\epsilon^2})$ for\nboth the $sa$- and $s$-rectangular cases (improvements of $\\mid S \\mid$ and\n$\\mid S \\mid\\mid A \\mid$ respectively). When the size of the uncertainty is\nsmall enough, we improve the sample complexity to\n$\\tilde{\\mathcal{O}}(\\frac{H^3 \\mid S \\mid\\mid A \\mid }{\\epsilon^2})$,\nrecovering the lower-bound for the non-robust case for the first time and a\nrobust lower-bound when the size of the uncertainty is small enough.\n","authors":["Pierre Clavier","Erwan Le Pennec","Matthieu Geist"],"pdf_url":"https://arxiv.org/pdf/2302.05372v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11389v1","updated":"2023-10-17T16:35:39Z","published":"2023-10-17T16:35:39Z","title":"VaR\\ and CVaR Estimation in a Markov Cost Process: Lower and Upper\n Bounds","summary":" We tackle the problem of estimating the Value-at-Risk (VaR) and the\nConditional Value-at-Risk (CVaR) of the infinite-horizon discounted cost within\na Markov cost process. First, we derive a minimax lower bound of\n$\\Omega(1/\\sqrt{n})$ that holds both in an expected and in a probabilistic\nsense. Then, using a finite-horizon truncation scheme, we derive an upper bound\nfor the error in CVaR estimation, which matches our lower bound up to constant\nfactors. Finally, we discuss an extension of our estimation scheme that covers\nmore general risk measures satisfying a certain continuity criterion, e.g.,\nspectral risk measures, utility-based shortfall risk. To the best of our\nknowledge, our work is the first to provide lower and upper bounds on the\nestimation error for any risk measure within Markovian settings. We remark that\nour lower bounds also extend to the infinite-horizon discounted costs' mean.\nEven in that case, our result $\\Omega(1/\\sqrt{n}) $ improves upon the existing\nresult $\\Omega(1/n)$[13].\n","authors":["Sanjay Bhat","Prashanth L. A.","Gugan Thoppe"],"pdf_url":"https://arxiv.org/pdf/2310.11389v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.16309v3","updated":"2023-10-17T16:34:46Z","published":"2023-05-25T17:58:14Z","title":"Imitating Task and Motion Planning with Visuomotor Transformers","summary":" Imitation learning is a powerful tool for training robot manipulation\npolicies, allowing them to learn from expert demonstrations without manual\nprogramming or trial-and-error. However, common methods of data collection,\nsuch as human supervision, scale poorly, as they are time-consuming and\nlabor-intensive. In contrast, Task and Motion Planning (TAMP) can autonomously\ngenerate large-scale datasets of diverse demonstrations. In this work, we show\nthat the combination of large-scale datasets generated by TAMP supervisors and\nflexible Transformer models to fit them is a powerful paradigm for robot\nmanipulation. To that end, we present a novel imitation learning system called\nOPTIMUS that trains large-scale visuomotor Transformer policies by imitating a\nTAMP agent. OPTIMUS introduces a pipeline for generating TAMP data that is\nspecifically curated for imitation learning and can be used to train performant\ntransformer-based policies. In this paper, we present a thorough study of the\ndesign decisions required to imitate TAMP and demonstrate that OPTIMUS can\nsolve a wide variety of challenging vision-based manipulation tasks with over\n70 different objects, ranging from long-horizon pick-and-place tasks, to shelf\nand articulated object manipulation, achieving 70 to 80% success rates. Video\nresults and code at https://mihdalal.github.io/optimus/\n","authors":["Murtaza Dalal","Ajay Mandlekar","Caelan Garrett","Ankur Handa","Ruslan Salakhutdinov","Dieter Fox"],"pdf_url":"https://arxiv.org/pdf/2305.16309v3.pdf","comment":"Conference on Robot Learning (CoRL) 2023. 8 pages, 5 figures, 2\n tables; 11 pages appendix (10 additional figures)"},{"id":"http://arxiv.org/abs/2310.03708v2","updated":"2023-10-17T16:29:35Z","published":"2023-10-05T17:35:26Z","title":"Beyond One-Preference-for-All: Multi-Objective Direct Preference\n Optimization for Language Models","summary":" A single language model (LM), despite aligning well with an average labeler\nthrough reinforcement learning from human feedback (RLHF), may not universally\nsuit diverse human preferences. Recent approaches thus pursue customization,\ntraining separate principle-based reward models to represent different\nalignment objectives (e.g. helpfulness, harmlessness, or honesty). Different\nLMs can then be trained for different preferences through multi-objective RLHF\n(MORLHF) with different objective weightings. Yet, RLHF is unstable and\nresource-heavy, especially for MORLHF with diverse and usually conflicting\nobjectives. In this paper, we present Multi-Objective Direct Preference\nOptimization (MODPO), an RL-free algorithm that extends Direct Preference\nOptimization (DPO) for multiple alignment objectives. Essentially, MODPO folds\nLM learning directly into reward modeling, aligning LMs with the weighted sum\nof all principle-based rewards using pure cross-entropy loss. While\ntheoretically guaranteed to produce the same optimal solutions as MORLHF, MODPO\nis practically more stable and computationally efficient, obviating value\nfunction modeling and online sample collection. Empirical results in safety\nalignment and long-form question answering confirm that MODPO matches or\noutperforms existing methods, consistently producing one of the most\ncompetitive LM fronts that cater to diverse preferences with 3 times fewer\ncomputations compared with MORLHF.\n","authors":["Zhanhui Zhou","Jie Liu","Chao Yang","Jing Shao","Yu Liu","Xiangyu Yue","Wanli Ouyang","Yu Qiao"],"pdf_url":"https://arxiv.org/pdf/2310.03708v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.05726v2","updated":"2023-10-17T16:25:25Z","published":"2023-06-09T07:46:24Z","title":"Iteratively Refined Behavior Regularization for Offline Reinforcement\n Learning","summary":" One of the fundamental challenges for offline reinforcement learning (RL) is\nensuring robustness to data distribution. Whether the data originates from a\nnear-optimal policy or not, we anticipate that an algorithm should demonstrate\nits ability to learn an effective control policy that seamlessly aligns with\nthe inherent distribution of offline data. Unfortunately, behavior\nregularization, a simple yet effective offline RL algorithm, tends to struggle\nin this regard. In this paper, we propose a new algorithm that substantially\nenhances behavior-regularization based on conservative policy iteration. Our\nkey observation is that by iteratively refining the reference policy used for\nbehavior regularization, conservative policy update guarantees gradually\nimprovement, while also implicitly avoiding querying out-of-sample actions to\nprevent catastrophic learning failures. We prove that in the tabular setting\nthis algorithm is capable of learning the optimal policy covered by the offline\ndataset, commonly referred to as the in-sample optimal policy. We then explore\nseveral implementation details of the algorithm when function approximations\nare applied. The resulting algorithm is easy to implement, requiring only a few\nlines of code modification to existing methods. Experimental results on the\nD4RL benchmark indicate that our method outperforms previous state-of-the-art\nbaselines in most tasks, clearly demonstrate its superiority over behavior\nregularization.\n","authors":["Xiaohan Hu","Yi Ma","Chenjun Xiao","Yan Zheng","Jianye Hao"],"pdf_url":"https://arxiv.org/pdf/2306.05726v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2107.02565v4","updated":"2023-10-17T16:22:04Z","published":"2021-07-06T12:08:44Z","title":"Prioritized training on points that are learnable, worth learning, and\n not yet learned (workshop version)","summary":" We introduce Goldilocks Selection, a technique for faster model training\nwhich selects a sequence of training points that are \"just right\". We propose\nan information-theoretic acquisition function -- the reducible validation loss\n-- and compute it with a small proxy model -- GoldiProx -- to efficiently\nchoose training points that maximize information about a validation set. We\nshow that the \"hard\" (e.g. high loss) points usually selected in the\noptimization literature are typically noisy, while the \"easy\" (e.g. low noise)\nsamples often prioritized for curriculum learning confer less information.\nFurther, points with uncertain labels, typically targeted by active learning,\ntend to be less relevant to the task. In contrast, Goldilocks Selection chooses\npoints that are \"just right\" and empirically outperforms the above approaches.\nMoreover, the selected sequence can transfer to other architectures;\npractitioners can share and reuse it without the need to recreate it.\n","authors":["Sören Mindermann","Muhammed Razzak","Winnie Xu","Andreas Kirsch","Mrinank Sharma","Adrien Morisot","Aidan N. Gomez","Sebastian Farquhar","Jan Brauner","Yarin Gal"],"pdf_url":"https://arxiv.org/pdf/2107.02565v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11377v1","updated":"2023-10-17T16:21:28Z","published":"2023-10-17T16:21:28Z","title":"Faster Algorithms for Generalized Mean Densest Subgraph Problem","summary":" The densest subgraph of a large graph usually refers to some subgraph with\nthe highest average degree, which has been extended to the family of $p$-means\ndense subgraph objectives by~\\citet{veldt2021generalized}. The $p$-mean densest\nsubgraph problem seeks a subgraph with the highest average $p$-th-power degree,\nwhereas the standard densest subgraph problem seeks a subgraph with a simple\nhighest average degree. It was shown that the standard peeling algorithm can\nperform arbitrarily poorly on generalized objective when $p>1$ but uncertain\nwhen $0