Engineer’s Codex is a publication about real-world software engineering.
Cursor, the popular AI IDE that recently announced they hit $300M ARR, uses Merkle trees to index code fast. This post goes over exactly how.
Before diving into Cursor's implementation, let's first understand what a Merkle tree is.
Merkle Trees Explained Simply
A Merkle tree is a tree structure in which every "leaf" node is labeled with the cryptographic hash of a data block, and every non-leaf node is labeled with the cryptographic hash of the labels of its child nodes. This creates a hierarchical structure where changes at any level can be efficiently detected by comparing hash values.
Think of them as a fingerprinting system for data:
Each piece of data (like a file) gets its own unique fingerprint (hash)
Pairs of fingerprints are combined and given a new fingerprint
This process continues until you have just one master fingerprint (the root hash)
The root hash summarizes all data contained in the individual pieces, serving as a cryptographic commitment to the entire dataset. The beauty of this approach is that if any single piece of data changes, it will change all the fingerprints above it, ultimately changing the root hash.
Dev Starter Packs (Sponsored)
Everything you need to build a profitable side project. Except the excuses.
DevStarterPacks gives you everything you need to launch stunning, production-ready mobile and web apps—auth, payments, analytics, and more—without the setup headache.
Save hundreds of hours and get lifetime access to tools trusted by devs who have launched fast and made thousands of MRR.
How Cursor Uses Merkle Trees for Codebase Indexing
Cursor uses Merkle trees as a core component of its codebase indexing feature. According to a post by Cursor's founder and the security documentation, here's how it works:
Step 1: Code Chunking and Processing
Cursor first chunks your codebase files locally, splitting code into semantically meaningful pieces before any processing occurs.
Step 2: Merkle Tree Construction and Synchronization
When codebase indexing is enabled, Cursor scans the folder opened in the editor and computes a Merkle tree of hashes of all valid files. This Merkle tree is then synchronized with Cursor's server, as detailed in Cursor's security documentation.
Step 3: Embedding Generation
After the chunks are sent to Cursor's server, embeddings are created using either OpenAI's embedding API or a custom embedding model (I couldn’t verify this). These vector representations capture the semantic meaning of the code chunks.
Step 4: Storage and Indexing
The embeddings, along with metadata like start/end line numbers and file paths, are stored in a remote vector database (Turbopuffer). To maintain privacy while still enabling path-based filtering, Cursor stores an obfuscated relative file path with each vector. Importantly, according to Cursor's founder, "None of your code is stored in our databases. It's gone after the life of the request."
Step 5: Periodic Updates Using Merkle Trees
Every 10 minutes, Cursor checks for hash mismatches, using the Merkle tree to identify which files have changed. Only the changed files need to be uploaded, significantly reducing bandwidth usage, as explained in Cursor's security documentation. This is where the Merkle tree structure provides its greatest value—enabling efficient incremental updates.
Code Chunking Strategies
The effectiveness of the codebase indexing largely depends on how code is chunked. While my previous explanation didn't go into detail about chunking methods, this blog post about building a Cursor-like codebase feature provides some insights:
While simple approaches split code by characters, words, or lines, they often miss semantic boundaries—resulting in degraded embedding quality.
You can split code based on a fixed token count, but this can cut off code blocks like functions or classes mid-way.
A more effective approach is to use an intelligent splitter that understands code structure, such as recursive text splitters that use high-level delimiters (e.g., class and function definitions) to split at appropriate semantic boundaries.
An even more elegant solution is to split the code based on its Abstract Syntax Tree (AST) structure. By traversing the AST depth-first, it splits code into sub-trees that fit within token limits. To avoid creating too many small chunks, sibling nodes are merged into larger chunks as long as they stay under the token limit. Tools like tree-sitter can be used for this AST parsing, supporting a wide range of programming languages.
For a crash course on tokens, read this.
Retrieval-Augmented Generation (RAG) for Code
Cursor’s codebase indexing is essentially a RAG system tailored for code. When you use @Codebase or ⌘ Enter to ask about your codebase, Cursor retrieves relevant code chunks from the vector database to provide context for large language model (LLM) calls. This approach allows the AI to "understand" your entire codebase without needing to fit it all within the context window of the underlying language model.
Why Cursor Uses Merkle Trees
Many of these details are security-related, and thus can be found in Cursor’s security documentation.
1. Efficient Incremental Updates
By using a Merkle tree, Cursor can quickly identify exactly which files have changed since the last synchronization. Instead of re-uploading the entire codebase, it only needs to upload the specific files that have been modified. This is important for large codebases where re-indexing everything would be too expensive in terms of bandwidth and processing time.
2. Data Integrity Verification
The Merkle tree structure allows Cursor to efficiently verify that the files being indexed match what's stored on the server. The hierarchical hash structure makes it easy to detect any inconsistencies or corrupted data during transfer.
3. Optimized Caching
Cursor stores embeddings in a cache indexed by the hash of the chunk, ensuring that indexing the same codebase a second time is much faster. This is great for teams where multiple developers might be working with the same codebase.
4. Privacy-Preserving Indexing
To protect sensitive information in file paths, Cursor implements path obfuscation by splitting the path by '/' and '.' characters and encrypting each segment with a secret key stored on the client. While this still reveals some information about directory hierarchy, it hides most sensitive details.
5. Git History Integration
When codebase indexing is enabled in a Git repository, Cursor also indexes the Git history. It stores commit SHAs, parent information, and obfuscated file names. To enable sharing the data structure for users in the same Git repo and on the same team, the secret key for obfuscating file names is derived from hashes of recent commit contents.
Embedding Models and Considerations
The choice of embedding model significantly impacts the quality of code search and understanding. While some systems use open-source models like all-MiniLM-L6-v2, Cursor likely uses either OpenAI's embedding models or custom embedding models specifically tuned for code. For specialized code embeddings, models like Microsoft's unixcoder-base or Voyage AI's voyage-code-2 are good for code-specific semantic understanding.
The embedding challenge is made more complex because embedding models have token limits. OpenAI's text-embedding-3-small model, for example, has a token limit of 8192. Effective chunking helps stay within token limits while preserving semantic meaning.
The Handshake Process
A key aspect of Cursor's Merkle tree implementation is the handshake process that occurs during synchronization. Logs from the Cursor application reveal that when initializing codebase indexing, Cursor creates a "merkle client" and performs a "startup handshake" with the server. This handshake involves sending the root hash of the locally computed Merkle tree to the server, as seen in Issue #2209 on GitHub and Issue #981 on GitHub.
The handshake process allows the server to determine which parts of the codebase need to be synced. Based on the handshake logs, we can see that Cursor computes the initial hash of the codebase and sends it to the server for verification, as documented in Issue #2209 on GitHub.
Technical Implementation Challenges
While the Merkle tree approach offers many advantages, it's not without implementation challenges. Cursor's indexing feature often experiences heavy load, causing many requests to fail. This can result in files needing to be uploaded several times before they get fully indexed. Users might notice higher than expected network traffic to 'repo42.cursor.sh' as a result of these retry mechanisms, as mentioned in Cursor's security documentation.
Another challenge relates to embedding security. Academic research has shown that reversing embeddings is possible in some cases. While current attacks typically rely on having access to the embedding model and working with short strings, there is a potential risk that an adversary who gains access to Cursor's vector database could extract information about indexed codebases from the stored embeddings.