I am a second year PhD student at Cornell Tech, advised by Prof. Yoav Artzi. My current research focuses on the efficient use of knowledge by LLMs.
Before Starting my PhD, I spent several years as an applied computer vision researcher and data scientist.
I hold a MSc in Computer Science from the Technion, where I was advised by Ran El-Yaniv. Prior to that, I completed my BSc in software engineering, also at the Technion.
A common strategy to reduce the computational costs of using long contexts in retrieval-augmented generation (RAG) with large language models (LLMs) is soft context compression, where the input sequence is transformed into a shorter continuous representation. We develop a lightweight and simple mean-pooling approach that consistently outperforms the widely used compression-tokens architecture, and study training the same compressor to output multiple compression ratios. We conduct extensive experiments across in-domain and out-of-domain QA datasets, as well as across model families, scales, and compression ratios. Overall, our simple mean-pooling approach achieves the strongest performance, with a relatively small drop when training for multiple compression ratios. More broadly though, across architectures and training regimes the trade-offs are more nuanced, illustrating the complex landscape of compression methods.
@article{feldman2025simplecontextcompressionmeanpooling,title={Simple Context Compression: Mean-Pooling and Multi-Ratio Training},author={Feldman, Yair and Artzi, Yoav},year={2025},journal={Preprint (Under Review)},}
Breadcrumbs Reasoning: Memory-Efficient Reasoning with Compression Beacons
The scalability of large language models for long-context reasoning is severely constrained by the linear growth of their Transformer key-value cache, which incurs significant memory and computational costs. We posit that as a model generates reasoning tokens, the informational value of past generated tokens diminishes, creating an opportunity for compression. In this work, we propose to periodically compress the generation KV cache with a learned, special-purpose token and evict compressed entries. We train the model to perform this compression via a modified joint distillation and reinforcement learning (RL) framework. Our training method minimizes overhead over the conventional RL process, as it leverages RL outputs for distillation. Empirically, our method achieves a superior memory-accuracy Pareto frontier compared to both the model without cache compression and training-free compression techniques.
@inproceedings{monea2025breadcrumbsreasoningmemoryefficientreasoning,title={Breadcrumbs Reasoning: Memory-Efficient Reasoning with Compression Beacons},author={Monea, Giovanni and Feldman, Yair and Padmanabhan, Shankar and Brantley, Kianté and Artzi, Yoav},year={2025},booktitle={Workshop on Efficient Reasoning @ NeurIPS},}
Multi-Hop Paragraph Retrieval for Open-Domain Question Answering
This paper is concerned with the task of multi-hop open-domain Question Answering (QA). This task is particularly challenging since it requires the simultaneous performance of textual reasoning and efficient searching. We present a method for retrieving multiple supporting paragraphs, nested amidst a large knowledge base, which contain the necessary evidence to answer a given question. Our method iteratively retrieves supporting paragraphs by forming a joint vector representation of both a question and a paragraph. The retrieval is performed by considering contextualized sentence-level representations of the paragraphs in the knowledge source. Our method achieves state-of-the-art performance over two well-known datasets, SQuAD-Open and HotpotQA, which serve as our single- and multi-hop open-domain QA benchmarks, respectively.
@inproceedings{feldman-el-yaniv-2019-multi,title={Multi-Hop Paragraph Retrieval for Open-Domain Question Answering},author={Feldman, Yair and El-Yaniv, Ran},booktitle={ACL},year={2019},}