在 Google Colab 上使用 Azure 认知搜索和 Python 构建语义搜索

介绍

Azure 认知搜索通过将高级人工智能 (AI) 与复杂的搜索技术交织在一起,提升了传统搜索引擎的功能,从而能够提取更相关和更有见地的结果。它超越了基于关键字的搜索方法,利用人工智能来理解用户的意图和术语的上下文意义,从而产生更符合用户需求的结果。利用自然语言处理和机器学习,它通过识别内容中的模式和关系来提供细微的响应,使其成为信息检索领域的宝贵工具。

步骤 1:安装必要的库

在 Colab 笔记本中,运行:

!pip install azure-search-documents pdfplumber

步骤 2:设置 Azure 认知搜索服务

通过 Azure 门户创建 Azure 认知搜索服务的新实例,并记下服务名称和管理员密钥。

步骤 3:初始化 Azure 搜索客户端

分别将 、 和替换为服务名称、管理密钥和所需的索引名称。service_name``admin_key``index_name

from azure.core.credentials import AzureKeyCredential
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents import SearchClient

service_name = "your_service_name"
admin_key = "your_admin_key"
index_name = "your_index_name"

endpoint = f"https://{service_name}.search.windows.net/"
admin_client = SearchIndexClient(endpoint=endpoint, index_name=index_name, credential=AzureKeyCredential(admin_key))
search_client = SearchClient(endpoint=endpoint, index_name=index_name, credential=AzureKeyCredential(admin_key))

步骤 4:创建索引

定义索引架构并创建索引。

from azure.search.documents.indexes.models import SearchIndex, SimpleField, SearchFieldDataType, SearchableField

fields = [
    SimpleField(name="id", type=SearchFieldDataType.String, key=True),
    SearchableField(name="title", type=SearchFieldDataType.String, sortable=True),
    SearchableField(name="content", type=SearchFieldDataType.String, analyzer_name="en.lucene"),
]
index = SearchIndex(name=index_name, fields=fields)
admin_client.create_index(index)

第 5 步:下载并阅读 PDF 内容

下载 PDF 并使用 提取其内容。pdfplumber

import pdfplumber
import requests

url = "https://raw.githubusercontent.com/fenago/datasets/main/books/Frederick_Douglass.pdf"
response = requests.get(url)
filename = "Frederick_Douglass.pdf"

with open(filename, 'wb') as file:
    file.write(response.content)

with pdfplumber.open(filename) as pdf:
    text = ''.join(page.extract_text() for page in pdf.pages)
print(text[:500])  # print the first 500 characters of the book

步骤 6:将数据上传到索引

将提取的内容上传到创建的索引。

batch = [{"@search.action": "upload", "id": "1", "title": "Frederick Douglass", "content": text}]
results = search_client.upload_documents(batch)

步骤 7:执行语义搜索

查询索引并获取语义相关的结果。

search_text = "freedom"
results = search_client.search(search_text=search_text, include_total_count=True)
for result in results:
    print(result)

search_text = "who is Frederick Douglas?"
results = search_client.search(search_text=search_text, include_total_count=True)

for result in results:
    print(f"ID: {result['id']}")
    print(f"Title: {result['title']}")
    print(f"Content: {result['content']}
{'='*40}
")
import json

search_text = "who is Frederick Douglas?"
results = search_client.search(search_text=search_text, include_total_count=True)

for result in results:
    print(json.dumps(result, indent=4))
    print('='*40)

该方法不返回整本书。相反,它返回搜索结果的集合,其中每个结果对应于索引中与搜索查询匹配的文档。search_client.search()``"who is Frederick Douglas?"

每个搜索结果通常包括:

每个搜索结果中的字段通常包含发生匹配的文档内容的一个片段或片段,而不是文档/书籍的全部内容。content

如果要检索文档的特定部分或限制字段中返回的文本量,则可以使用该参数指定要包含在搜索结果中的字段,还可以使用突出显示功能获取发生匹配的内容的特定部分。content``$select

例如:

search_text = "who is Frederick Douglas?"
results = search_client.search(search_text=search_text, include_total_count=True)

# Initialize a flag to check if results are found
found = False

# Iterate over the results and print each one
for result in results:
    print(f"ID: {result['id']}")
    print(f"Title: {result['title']}")
    print(f"Content: {result['content']}
{'='*40}
")
    found = True

# Check if no results were found
if not found:
    print("No results found for the search query.")

这将仅返回结果中的 and 字段,不包括字段。id``title``content

search_text = "who is Frederick Douglas?"
results = search_client.search(search_text=search_text, include_total_count=True)

found = False

for result in results:
    print(f"
{'='*40}")
    print(f"ID: {result['id']}")
    print(f"Title: {result['title']}")

    # You can truncate or format the content to make it more readable
    content = result['content']
    if len(content) > 200:  # Limiting to 200 characters, you can adjust as needed
        content = content[:200] + "..."
    print(f"Content: {content}")
    print(f"{'='*40}
")

    found = True

if not found:
    print("No results found for the search query.")

结论

在此修订后的教程中,我们完成了创建索引、将 PDF 文档上传到 Azure 认知搜索以及执行语义搜索的过程。通过执行这些步骤,您可以创建适合您的特定需求的可靠语义搜索解决方案。

展开阅读全文

页面更新:2024-02-12

标签:语义   认知   密钥   人工智能   字段   索引   步骤   名称   文档   内容

1 2 3 4 5

上滑加载更多 ↓
推荐阅读:
友情链接:
更多:

本站资料均由网友自行发布提供,仅用于学习交流。如有版权问题,请与我联系,QQ:4156828  

© CopyRight 2020-2024 All Rights Reserved. Powered By 71396.com 闽ICP备11008920号-4
闽公网安备35020302034903号

Top