Understanding non-relational and vector databases

In this post, we will try to gain an intuitive understanding of MongoDB, Elasticsearch, and Qdrant with examples.

We will start with a technical overview of each database, then differentiate MongoDB and Elasticsearch technically. Finally, we will build an intuition around all three databases simultaneously with a librarian analogy.

MongoDB

MongoDB is a popular NoSQL document database that stores data in flexible, JSON-like documents with dynamic schemas. It is designed for scalability, high performance, and ease of use.

Key Concepts:

Document-oriented data model
Dynamic schema
Embedded data and arrays
Horizontal scaling through sharding
Replication for high availability

Example Usage:

import pymongo

# Connect to MongoDB
client = pymongo.MongoClient("mongodb://localhost:27017/")
db = client["mydatabase"]
collection = db["users"]

# Insert a document
user = {"name": "John", "age": 30, "email": "john@example.com"}
result = collection.insert_one(user)

# Query documents
for doc in collection.find({"age": {"$gt": 25}}):
    print(doc)

MongoDB excels in handling unstructured or semi-structured data, making it suitable for content management, mobile apps, IoT, and real-time analytics.

Elasticsearch

Elasticsearch is a distributed, RESTful search and analytics engine built on Apache Lucene. It provides near real-time search, powerful full-text capabilities, and scalable data analysis.

Key Concepts:

Inverted index for fast search
Distributed architecture with shards and replicas
Aggregations and analytics
RESTful API

Example Usage:

from elasticsearch import Elasticsearch

# Connect to Elasticsearch
es = Elasticsearch("http://localhost:9200")

# Index a document
doc = {"name": "Jane", "age": 28, "city": "New York"}
res = es.index(index="people", body=doc)

# Search for documents
query = {"query": {"match": {"city": "New York"}}}
results = es.search(index="people", body=query)

Elasticsearch shines in applications that require fast full-text search, log analysis, business analytics, security event monitoring, and geospatial data analysis.

Qdrant

Qdrant is an open-source vector database optimized for storing, searching, and managing high-dimensional vector data, making it suitable for machine learning and AI applications.

Key Concepts:

Specialized for vector data
Fast vector similarity search
Real-time analytics and queries
Efficient storage and indexing of high-dimensional data

Example Usage:

import qdrant_client

# Connect to Qdrant
client = qdrant_client.QdrantClient(path="localhost:6333")
collection_info = client.recreate_collection(
    collection_name="vectors",
    vectors_config=qdrant_client.models.VectorsConfig(size=512, distance="Cosine"),
)

# Insert vectors
vectors = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
client.upsert(
    collection_name="vectors",
    points=qdrant_client.models.Payload(vectors=vectors),
)

# Search for similar vectors
search_result = client.search(
    collection_name="vectors",
    query_vector=[1, 1, 1],
    limit=2,
)

Qdrant is optimized for applications that require vector similarity search, such as recommendation systems, content-based image/video retrieval, fraud detection, and natural language processing.

Overall, MongoDB excels at handling unstructured data with flexible schemas, Elasticsearch provides powerful search and analytics capabilities, and Qdrant is specialized for efficient vector similarity search. The choice depends on the specific requirements around data structure, search needs, scalability, and performance.

Summary of technical similarities and differences between MongoDB and Elasticsearch

Similarities:

NoSQL: Both Elasticsearch and MongoDB are NoSQL databases that store data in a flexible, schema-less JSON-like document format.
Distributed Architecture: They are designed to be distributed across multiple nodes/servers for scalability and high availability.
Horizontal Scaling: Both can scale horizontally by adding more nodes to the cluster as data grows.
Open Source: Elasticsearch and MongoDB have open-source versions available.

Differences:

Primary Purpose:
- Elasticsearch is primarily a distributed search and analytics engine built on Apache Lucene. Its strength lies in full-text search, indexing, and data analysis capabilities [9] [11] [13].
- MongoDB is a general-purpose, document-oriented database designed for storing and retrieving data flexibly and efficiently [9] [11].
Data Storage:
- Elasticsearch uses an inverted index data structure optimized for fast full-text search and analytics [11] [13].
- MongoDB stores data in a more traditional database-like structure using collections and documents [9] [11].
Query Language:
- Elasticsearch provides a powerful query DSL (Domain Specific Language) for complex search queries and aggregations [9] [11].
- MongoDB uses a rich query language with support for ad-hoc queries, indexing, and aggregations [9] [11].
Use Cases:
- Elasticsearch excels in applications that require full-text search, log analysis, security event monitoring, and real-time analytics [9] [11] [13].
- MongoDB is well-suited for applications that require flexible data modeling, high scalability, and real-time data access, such as content management, mobile apps, and IoT [9] [11].
Data Manipulation:
- Elasticsearch is optimized for read operations (search and analytics) rather than write operations [11] [13].
- MongoDB supports efficient read and write operations, making it suitable for both data storage and retrieval [9] [11].

In summary, while both Elasticsearch and MongoDB are NoSQL databases with distributed architectures, Elasticsearch specializes in search and analytics, while MongoDB focuses on flexible data storage and retrieval. The choice between them depends on the specific requirements of your application, such as whether you prioritize search capabilities or data modeling and storage [9] [11] [13].

Librarian analogy to solidify distinctions between all three platforms

Think of each database (MongoDB, Elasticsearch, Qdrant) as a digital library each with a uniquely skilled librarian. Let us understand how each of the three librarians organize their library.

All three databases can store entire books, beyond just titles or specific fields. The difference lies in how they are optimized for searching and retrieving data.

Storing and Retrieving Data

In MongoDB, you would store the entire book content as a single document, along with other metadata like title, author, etc. You can retrieve the entire book document based on its unique identifier or other fields.

Example book document in MongoDB:

{
  "_id": ObjectId("609f5a5c8f7f9c2b9c6d5e4f"),
  "title": "To Kill a Mockingbird",
  "author": "Harper Lee",
  "content": "The entire book content goes here..."
}

In Elasticsearch, you would also store the entire book content, typically broken down into smaller fields like chapters or pages, along with metadata fields.

Example book document in Elasticsearch:

{
  "_index": "books",
  "_id": "1",
  "_source": {
    "title": "To Kill a Mockingbird",
    "author": "Harper Lee",
    "chapters": [
      {
        "title": "Chapter 1",
        "content": "The content of chapter 1 goes here..."
      },
      {
        "title": "Chapter 2",
        "content": "The content of chapter 2 goes here..."
      }
    ]
  }
}

Searching for Specific Content

If you want to search for a specific sentence within the book content, both Elasticsearch and MongoDB can do that, but with different approaches and performance characteristics.

In MongoDB, you can use the $regex operator to perform a regular expression search on the book content:

sentence = "Shoot all the bluejays you want, if you can hit 'em, but remember it's a sin to kill a mockingbird."
query = {"content": {"$regex": sentence}}
book = db.books.find_one(query)

This query will search for the exact sentence within the content field of the book documents. If a match is found, it will return the entire book document that contains the sentence.

In Elasticsearch, you can use the match_phrase query to search for the specific sentence:

query = {
    "query": {
        "match_phrase": {
            "content": "Shoot all the bluejays you want, if you can hit 'em, but remember it's a sin to kill a mockingbird."
        }
    }
}
results = es.search(index="books", body=query)

Elasticsearch will search for the exact sentence within the content field of the book documents and return the relevant book documents that contain the sentence, along with metadata like the chapter or page where the sentence appears.

The key difference is that Elasticsearch is optimized for full-text search and can perform this type of search more efficiently than MongoDB, especially on large datasets. It uses an inverted index to quickly find documents that contain the specified terms or phrases.

Qdrant: The Vector Similarity Specialist

Qdrant is like a librarian with a special ability to find similar books based on their content. When you add a book to the library, Qdrant converts the book’s content into a unique mathematical representation called a vector.

When you want to find books similar to a specific book, you give Qdrant the book’s vector. Qdrant then compares this vector with the vectors of all the other books in the library and returns a list of books that are most similar in content. This is called vector similarity search, and it’s useful when you want to find books that are conceptually similar, even if they don’t contain the exact same words [23] [29].

In summary:

Elasticsearch is like a powerful search engine that excels at fast, full-text search based on the content of the books.
MongoDB is like a flexible document store that can handle books of different structures and allows for complex queries.
Qdrant is like a vector similarity specialist that finds books with similar content based on their mathematical representations.

Each librarian (database) has its strengths, and the best choice depends on specific needs around fast text search, flexible data storage, or finding similar items based on their content.

We had a basic yet conceptually important doubt at this point, and probed Claude on it. To our pleasant surprise, Claude gave a pretty interesting and nuanced response. Let’s check it out.

Can vector databases be considered the only “truly” schema-less databases? or maybe there is some structure built into the embedding, but post embedding and storage, the retrieval is entirely nonrelational. As a side note, when you say “query by example” for nosql, is the motivation to train a machine learning model? if so, wouldn’t it make NoSQL databases essentially vector databases since ML needs vector embeddings?

You raise some great points about the relationship between vector databases and the concept of being “schemaless”. Let me share my perspective on this.

While vector databases are often considered schemaless because they don’t enforce a rigid, predefined schema on the data, I wouldn’t say they are the only “truly” schemaless databases. Other NoSQL databases like document stores (e.g. MongoDB) and key-value stores (e.g. Redis) also provide a high degree of flexibility in terms of the structure of the data they store.

The key difference is that in vector databases, the primary representation of the data is the vector embedding itself. And you’re right that there is some inherent structure in these embeddings based on how they were generated (e.g. which language model was used). But the database itself doesn’t impose constraints on what those embeddings represent or how they relate to each other. The relationships and similarities emerge from the vector representations themselves.

So in that sense, the retrieval in vector databases is indeed more “non-relational” compared to other NoSQL databases where you still query based on keys, fields, or document structure to some degree. Vector search is entirely based on the mathematical properties of the embeddings.

Regarding your question about “query by example” in NoSQL databases - this can refer to a few different things:

In document databases, it often means providing an example document to match against, and the database returns documents that have a similar structure/fields. This is more about structural similarity than semantic similarity.
In some NoSQL databases that support machine learning, you can provide an example data point and ask the database to return similar data points based on a trained ML model. This is closer to what vector databases do.
And as you noted, if the NoSQL database is actually storing and querying vector embeddings generated by an ML model, then it essentially becomes a vector database, even if it didn’t start out that way.

So in summary, while I wouldn’t consider vector databases the only truly schemaless option, they do take the concept of non-relational, similarity-based retrieval to the extreme by operating entirely in the vector space. And you’re right that as more NoSQL databases incorporate machine learning and vector embeddings, the line between them and purpose-built vector databases starts to blur.

The key is understanding the core retrieval and similarity mechanisms of the database - whether it’s based on keys, document structure, vector math, or a combination. Each has its strengths and use cases.

References

Assisted by claude-3-opus on perplexity.ai

Written on April 3, 2024