AIreflections#1 - future of database search

Let us understand the prospects and challenges of using vector embeddings in database search. In the context of language, this is often called semantic search, although Google products today are evidence that semantic search is certainly not limited to text!

Have relational database systems been fully replaced by vector embeddings today?

No, relational database systems have not been fully replaced by vector embeddings today. Both relational databases and vector databases serve different purposes and are used in different contexts based on the needs of the application.

Relational databases store data in rows and columns (tabular format), with each row representing an entity and each column representing an attribute of that entity [1]. They are highly structured and use SQL (Structured Query Language) for querying. Relational databases are excellent for applications that require complex queries, transactional integrity, and data normalization. They are widely used in traditional business applications for managing structured data such as financial records, customer data, and inventory.

Vector databases, on the other hand, store data as vectors — sequences of numbers that can represent the essence of an item [1]. They are designed to handle unstructured data such as text, images, and audio files, making them suitable for applications involving machine learning, natural language processing, and similarity search. Vector databases use embedding models to transform raw data into vector embeddings, which are then used for efficient similarity searches [1].

While vector databases are becoming increasingly important for AI-driven applications and serve as long-term memory for large language models (LLMs), relational databases continue to be crucial for a wide range of traditional data management tasks [2]. The choice between a relational database and a vector database depends on the specific requirements of the application, including the type of data being managed, the nature of the queries, and the need for transactional support.

In summary, relational databases and vector databases complement each other, and both are essential components of the modern data ecosystem.

I read that google today uses vector similarity search for most of its products. Was this the case in 2004? How does vector similarity search reconcile with page rank? Explain the difference between “indexing” and “embedding”

As of 2004, Google’s primary mechanism for ranking web pages was the PageRank algorithm, not vector similarity search. PageRank, developed by Google’s founders Larry Page and Sergey Brin during their time at Stanford University, is an algorithm that ranks web pages in search engine results based on the number and quality of links to those pages. It operates on the principle that important websites are likely to receive more links from other websites [5]. This algorithm was a foundational element of Google’s search engine and contributed significantly to its early success in providing relevant search results.

Over the years, Google has significantly evolved its search technologies to incorporate a wide range of signals and methods for understanding and ranking content. One of these advancements includes the use of vector similarity search, particularly in the context of understanding the semantic meaning of text, images, and other types of content. Vector similarity search involves representing items as vectors in a high-dimensional space, where the distance between vectors indicates their similarity. This approach is particularly useful for understanding the content and context of queries and documents, enabling more semantically relevant search results [3] [7] [6].

Vector similarity search and PageRank can be seen as complementary technologies. While PageRank focuses on the importance of a page based on its link structure, vector similarity search focuses on the semantic relevance of the content itself. Google likely uses a combination of these and many other signals to rank search results, providing users with content that is both authoritative and contextually relevant [6].

Regarding the difference between “indexing” and “embedding”:

  • Indexing is a process used in databases and search engines to improve the speed of data retrieval. It involves creating a data structure (an index) that allows for fast lookup of information. In the context of search engines, indexing involves crawling web pages, processing their content, and storing a representation of this content in a way that makes it quick to retrieve in response to search queries [4] [5].

  • Embedding, on the other hand, refers to the process of representing data (such as text, images, or other types of content) as vectors in a high-dimensional space. This representation captures the semantic meaning of the data, allowing for operations like similarity search. Embeddings are generated using machine learning models and are a key component of vector similarity search systems. They enable the comparison of items based on their content and context rather than just matching keywords [3] [4].

In summary, while PageRank was the cornerstone of Google’s search technology in 2004, the company has since incorporated vector similarity search and other advanced techniques to improve search relevance. Indexing and embedding serve different purposes, with indexing facilitating fast data retrieval and embedding enabling semantic understanding and comparison of content.

Is indexing done before or after embedding ?

Indexing is typically done after embedding in the context of vector databases or systems that use vector similarity search. The process generally follows these steps:

  1. Embedding: First, raw data (such as text, images, or audio) is transformed into a vector representation using an embedding model. This model converts the data into vectors in a high-dimensional space, capturing the semantic meaning or features of the data. The embedding process is crucial for enabling similarity searches, as it allows data to be compared based on its content and context.

  2. Indexing: After the data is embedded, the resulting vectors are indexed. Indexing vectors involves organizing them in a way that makes it efficient to perform similarity searches. Various indexing algorithms and data structures, such as k-d trees, hash tables, or more advanced approximate nearest neighbor (ANN) search algorithms, can be used to speed up the retrieval of similar vectors from a large dataset.

The reason for this sequence is that the embedding process generates the high-dimensional vectors that need to be efficiently searched and retrieved based on similarity. Indexing these vectors allows for fast and efficient similarity searches, which are central to the functionality of systems that rely on vector embeddings for tasks such as recommendation systems, search engines, and other machine learning applications.

In summary, embedding transforms raw data into a vector form that captures its semantic meaning, and indexing organizes these vectors to enable efficient similarity searches. The embedding step must precede indexing because the vectors generated by the embedding process are what the indexing structures are designed to handle and organize.

How do you perplexity.ai embed the web? or do you depend on somebody else’s embeddings?

Perplexity AI does not embed the entire web itself. Instead, it uses a combination of indexing/ranking and language model-based search to answer queries [8]. When a question is asked, Perplexity first performs a search for the query, pulling relevant web pages from its search index. It then uses a language model to comb through these web pages and extract the exact information relevant to the question. The language models used by Perplexity AI, such as OpenAI’s GPT models, are pre-trained on a diverse range of internet text. These models generate embeddings for the text as part of their operation, which allows them to understand and generate human-like text based on the input they receive. However, the embeddings used by these models are not specifically for the entire web but rather are derived from the training data the models were exposed to during their creation.

Perplexity AI also offers an API that allows developers to access its capabilities, including text completion and potentially embeddings in the future, as suggested by a question on their Reddit community [9]. However, as of the information available, Perplexity AI relies on its own search and language model capabilities rather than depending on external embeddings of the web. Here is an outline of the two-step process that avoids relying on real-time embeddings of web content:

  1. Search Index: Perplexity AI maintains a search index that contains information about web pages. When a query is received, the system uses this index to quickly find and retrieve web pages that are likely to contain relevant information. This index is built and updated regularly to include both fresh and old content, ensuring that the search results can cover a wide range of information available on the web.

  2. Language Model Processing: After retrieving a set of potentially relevant web pages from the search index, Perplexity AI uses its language model to analyze the content of these pages. The language model can understand the context and semantics of both the query and the content on the web pages. It processes the text to determine the most relevant and accurate information to present in response to the query. The language model itself generates embeddings for the input query to understand its meaning. However, it does not necessarily require real-time embeddings of the entire web content. Instead, it relies on the pre-trained knowledge encoded during its training on a diverse dataset, which includes a large portion of the internet text. This pre-trained knowledge allows the model to infer and provide information that is contextually relevant to the query.

Perplexity AI primarily uses vector embeddings as part of its language model (LLM) inference process rather than for direct retrieval of web content. Embedding for retrieval would involve creating vector embeddings for all the content on the web (or a large subset of it) in advance, and then using these embeddings to perform similarity searches to retrieve relevant documents based on the vector representation of a user’s query. This is a computationally intensive process and requires maintaining a massive index of pre-computed embeddings. Embeddings during inference might include both of the query itself and that of relevant metadata from the retrieved web pages.

Is there a good open source multimodal embedding generator today?

Yes, there are open-source multimodal embedding generators available today. One notable example is Meta AI’s ImageBind, which is an open-source AI model that combines six types of data into a single embedding space [10] [11] [12]. ImageBind can process visual (image and video), thermal (infrared images), text, audio, depth information, and movement readings generated by an inertial measuring unit (IMU) into a unified vector representation. This model is part of Meta’s research and is shared with the broader AI community, allowing others to build upon or utilize the technology for various applications.

Another example is Jina AI’s open-source embedding model, which is designed for creating sentence embeddings and supports context lengths up to 8192 tokens [13]. While Jina AI’s model is primarily focused on text, the company is known for its work in multimodal AI, and their tools and infrastructure can be used to build multimodal AI applications[14].

Additionally, the NExT-GPT model from the National University of Singapore is an end-to-end multimodal LLM that can handle input and generate output in combinations of text, image, video, and audio [15].

These open-source models and tools enable developers and researchers to experiment with and deploy multimodal AI systems without the need to develop such complex models from scratch.

I guess the ideal future is: use vector embeddings both in retrieval and generation multimodally. Or am i missing something?

Your vision for the future of AI systems aligns with the direction in which the field is moving. The ideal scenario indeed involves using vector embeddings for both retrieval and generation across multiple modalities (text, images, audio, video, etc.). Here’s why this approach is powerful:

  1. Semantic Understanding: Vector embeddings allow systems to understand the semantic meaning of content, which is crucial for accurately retrieving and generating information that is contextually relevant.

  2. Cross-Modal Interactions: In a multimodal system, embeddings enable the AI to understand and relate information across different types of data. For example, it could retrieve relevant images based on a text description or generate descriptive text for a video clip.

  3. Improved Search: Using embeddings for retrieval can enhance search engines by allowing them to return results that are semantically related to the query, rather than relying solely on keyword matching.

  4. Personalization and Relevance: Embeddings can be used to personalize content and improve the relevance of recommendations in systems like digital assistants, e-commerce platforms, and content streaming services.

  5. Efficiency and Scalability: With advancements in approximate nearest neighbor (ANN) search algorithms and indexing structures, it’s becoming more feasible to use embeddings for efficient retrieval from large datasets.

  6. Unified AI Systems: A system that uses a universal embedding space for different modalities can be more easily integrated, managed, and scaled, as it simplifies the underlying architecture.

While there are challenges to overcome, such as the computational cost of generating and storing embeddings and the need for robust models that can handle diverse and noisy real-world data, the benefits of such an approach are driving research and development in this area. As AI technology continues to advance, we can expect to see more sophisticated systems that leverage vector embeddings for both retrieval and generation in a multimodal context.


References

[1] phdata.io: What is a Vector Database
[2] eckerson.com: The Why, What, Who, and Where of Vector Databases
[3] featureform.com: Embeddings in Machine Learning: Everything You Need to Know
[4] linkedin.com: What’s the difference between indexing and embedding?
[5] scholars.wlu.ca: Web Search Algorithms and PageRank
[6] cloud.google.com: Find anything blazingly fast with Google’s vector search technology
[7] towardsdatascience.com: Embeddings + Knowledge Graphs: The Ultimate Tools for RAG Systems
[8] allthings.how: What is Perplexity AI and How to Use It?
[9] reddit.com: Embeddings & PPLX API Endpoint
[10] metademolab.com: ImageBind Demo
[11] theverge.com: Meta Open-Sources Multisensory AI Model
[12] petapixel.com: Meta Unveils Open-Source Multimodal Generative AI System
[13] infoq.com: Jina AI’s Open-Source Model for Sentence Embeddings
[14] jina.ai: Jina AI - Tools for Building Multimodal AI Applications
[15] github.com: NExT-GPT - Multimodal End-to-End LLM

Based on a chat with perplexity.ai

Written on March 20, 2024