Elasticsearch Utility
The
elasticsearch_util.py
file contains utility functions for interacting with Elasticsearch in the Bayard app. It provides functionality to search for relevant documents based on user input using the Elasticsearch search API. The file is responsible for establishing a connection to the Elasticsearch cluster, constructing search queries, executing searches, and processing the search results.Configuration
The configuration section of the code retrieves the necessary information for connecting to the Elasticsearch cluster. The Elasticsearch URL and API key are obtained from environment variables named
ES_URL
and ES_API_KEY
, respectively. These environment variables should be properly set to ensure a successful connection to the Elasticsearch cluster.pythonES_URL = os.environ.get("ES_URL") ES_API_KEY = os.environ.get("ES_API_KEY")
After retrieving the URL and API key, an Elasticsearch client instance named
es_client
is created using the Elasticsearch
class from the elasticsearch
library. The client is initialized with the provided URL and API key, allowing the application to interact with the Elasticsearch cluster.pythones_client = Elasticsearch(ES_URL, api_key=ES_API_KEY)
Functions
search_elasticsearch(user_input)
The
search_elasticsearch
function is the main entry point for performing searches in Elasticsearch. It takes a single parameter, user_input
, which represents the user's search query. The function constructs a search query using the text_expansion
query type with the ELSER model, executes the search, and processes the search results.The search query is defined in the
search_body
dictionary, which specifies the text_expansion
query type and the content_embedding
field. The model_id
parameter indicates the specific ELSER model to be used for the search, and the model_text
parameter is set to the user_input
provided by the user. The size
parameter determines the maximum number of documents to be returned in the search results.pythonsearch_body = { "query": { "text_expansion": { "content_embedding": { "model_id": ".elser_model_2_linux-x86_64", "model_text": user_input } } }, "size": 3 }
The search is executed using the
search
method of the es_client
instance. The search is performed on the "bayardcorpus" index, and the search_body
dictionary is passed as the request body.pythonsearch_results = es_client.search(index="bayardcorpus", body=search_body)
After executing the search, the function retrieves the search hits from the
hits
field of the search results. It then processes each hit, extracting relevant fields such as the document title, abstract, authors, classification, concepts, year published, download URL, emotion, sentiment, categories, and unique identifier. The extracted information is stored in a dictionary named filtered_doc
.To avoid duplicate results based on the document title, the function maintains a set called
seen_titles
. If a document title has already been encountered, it is skipped to ensure uniqueness in the search results.The processed documents are appended to the
filtered_docs
list, which is returned by the function. If no documents are found, an empty list is returned.In case an error occurs during the search process, the function catches the exception, prints an error message, and returns
None
to indicate a failure.Usage
To use the Elasticsearch functionality provided by the
elasticsearch_util.py
file, you can simply call the search_elasticsearch
function with the user's search query as the argument. For example:
pythonuser_input = "LGBTQ+ rights" search_results = search_elasticsearch(user_input)
The function will execute the search in Elasticsearch using the provided user input and return a list of filtered documents that match the search query. Each document in the list is represented as a dictionary containing various fields such as the document title, abstract, authors, and more.
By leveraging the power of Elasticsearch and the ELSER model, the
elasticsearch_util.py
file enables efficient and relevant document searching within the Bayard app. It provides a convenient interface for retrieving documents based on user input and allows for further processing and utilization of the search results in the application.