Azure 认知搜索通过将高级人工智能 (AI) 与复杂的搜索技术交织在一起,提升了传统搜索引擎的功能,从而能够提取更相关和更有见地的结果。它超越了基于关键字的搜索方法,利用人工智能来理解用户的意图和术语的上下文意义,从而产生更符合用户需求的结果。利用自然语言处理和机器学习,它通过识别内容中的模式和关系来提供细微的响应,使其成为信息检索领域的宝贵工具。
在 Colab 笔记本中,运行:
!pip install azure-search-documents pdfplumber
通过 Azure 门户创建 Azure 认知搜索服务的新实例,并记下服务名称和管理员密钥。
分别将 、 和替换为服务名称、管理密钥和所需的索引名称。service_name``admin_key``index_name
from azure.core.credentials import AzureKeyCredential
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents import SearchClient
service_name = "your_service_name"
admin_key = "your_admin_key"
index_name = "your_index_name"
endpoint = f"https://{service_name}.search.windows.net/"
admin_client = SearchIndexClient(endpoint=endpoint, index_name=index_name, credential=AzureKeyCredential(admin_key))
search_client = SearchClient(endpoint=endpoint, index_name=index_name, credential=AzureKeyCredential(admin_key))
定义索引架构并创建索引。
from azure.search.documents.indexes.models import SearchIndex, SimpleField, SearchFieldDataType, SearchableField
fields = [
SimpleField(name="id", type=SearchFieldDataType.String, key=True),
SearchableField(name="title", type=SearchFieldDataType.String, sortable=True),
SearchableField(name="content", type=SearchFieldDataType.String, analyzer_name="en.lucene"),
]
index = SearchIndex(name=index_name, fields=fields)
admin_client.create_index(index)
下载 PDF 并使用 提取其内容。pdfplumber
import pdfplumber
import requests
url = "https://raw.githubusercontent.com/fenago/datasets/main/books/Frederick_Douglass.pdf"
response = requests.get(url)
filename = "Frederick_Douglass.pdf"
with open(filename, 'wb') as file:
file.write(response.content)
with pdfplumber.open(filename) as pdf:
text = ''.join(page.extract_text() for page in pdf.pages)
print(text[:500]) # print the first 500 characters of the book
将提取的内容上传到创建的索引。
batch = [{"@search.action": "upload", "id": "1", "title": "Frederick Douglass", "content": text}]
results = search_client.upload_documents(batch)
查询索引并获取语义相关的结果。
search_text = "freedom"
results = search_client.search(search_text=search_text, include_total_count=True)
for result in results:
print(result)
search_text = "who is Frederick Douglas?"
results = search_client.search(search_text=search_text, include_total_count=True)
for result in results:
print(f"ID: {result['id']}")
print(f"Title: {result['title']}")
print(f"Content: {result['content']}
{'='*40}
")
import json
search_text = "who is Frederick Douglas?"
results = search_client.search(search_text=search_text, include_total_count=True)
for result in results:
print(json.dumps(result, indent=4))
print('='*40)
该方法不返回整本书。相反,它返回搜索结果的集合,其中每个结果对应于索引中与搜索查询匹配的文档。search_client.search()``"who is Frederick Douglas?"
每个搜索结果通常包括:
每个搜索结果中的字段通常包含发生匹配的文档内容的一个片段或片段,而不是文档/书籍的全部内容。content
如果要检索文档的特定部分或限制字段中返回的文本量,则可以使用该参数指定要包含在搜索结果中的字段,还可以使用突出显示功能获取发生匹配的内容的特定部分。content``$select
例如:
search_text = "who is Frederick Douglas?"
results = search_client.search(search_text=search_text, include_total_count=True)
# Initialize a flag to check if results are found
found = False
# Iterate over the results and print each one
for result in results:
print(f"ID: {result['id']}")
print(f"Title: {result['title']}")
print(f"Content: {result['content']}
{'='*40}
")
found = True
# Check if no results were found
if not found:
print("No results found for the search query.")
这将仅返回结果中的 and 字段,不包括字段。id``title``content
search_text = "who is Frederick Douglas?"
results = search_client.search(search_text=search_text, include_total_count=True)
found = False
for result in results:
print(f"
{'='*40}")
print(f"ID: {result['id']}")
print(f"Title: {result['title']}")
# You can truncate or format the content to make it more readable
content = result['content']
if len(content) > 200: # Limiting to 200 characters, you can adjust as needed
content = content[:200] + "..."
print(f"Content: {content}")
print(f"{'='*40}
")
found = True
if not found:
print("No results found for the search query.")
在此修订后的教程中,我们完成了创建索引、将 PDF 文档上传到 Azure 认知搜索以及执行语义搜索的过程。通过执行这些步骤,您可以创建适合您的特定需求的可靠语义搜索解决方案。
页面更新:2024-02-12
本站资料均由网友自行发布提供,仅用于学习交流。如有版权问题,请与我联系,QQ:4156828
© CopyRight 2020-2024 All Rights Reserved. Powered By 71396.com 闽ICP备11008920号-4
闽公网安备35020302034903号