Amazon CloudSearch

2022-04-27 00:00:00 专区 订阅 付费 交易 评论

Overview
Amazon CloudSearch is a fully managed service in the cloud that makes it easy to set up, manage, and scale a search solution for your website or application.
With Amazon CloudSearch you can search large collections of data such as web pages, document files, forum posts, or product information.
As your volume of data and traffic fluctuates, Amazon CloudSearch scales to meet your needs.
You can use Amazon CloudSearch to index and search both structured data and plain text. Amazon CloudSearch features:
Full text search with language-specific text processing
Boolean search
Prefix searches
Range searches
Term boosting
Faceting
Highlighting
Autocomplete Suggestions
You can get search results in JSON or XML, sort and filter results based on field values, and sort results alphabetically, numerically, or according to custom expressions.
To build a search solution with Amazon CloudSearch, you take the following steps:
Create and configure a search domain: If you have multiple collections of data that you want to make searchable, you can create multiple search domains.
Upload the data you want to search to your domain.
Search your domain.
How Search Works
The collection of data that you want to search (sometimes referred to as your corpus) can consist of unstructured full-text documents, semi-structured documents such as those formatted in mark-up languages like XML, or structured data that conforms to a strict data model.
Each item that you want to be able to search (such as a forum post or web page) is represented as a document.
Every document has a unique ID and one or more fields that contain the data that you want to search and include in results.
To make your data searchable:
you represent it as a batch of documents in either JSON or XML and upload the batch to your search domain
Amazon CloudSearch then generates a search index from your document data according to your domain's configuration options
You submit queries against this index to find the documents that meet specific search criteria.
Indexing in Amazon CloudSearch
To build a search index from your data, Amazon CloudSearch needs the following information:
Which document fields do you want to search?
Which document field values do you want to retrieve with the search results?
Which document fields represent categories that you want to use to refine and filter search results?
How should the text within a particular field be processed?
You define this metadata in your domain configuration by configuring indexing options.
You must configure a corresponding index field for each document field that occurs in your data—there's a one-to-one mapping between document fields and the fields in your Amazon CloudSearch index. In addition to the index field name, you specify the following:
The index field type
Whether the field is searchable (text and text-array fields are always searchable)
Whether the field can be used as a category (facet)
Whether the field value can be returned with the search results
Whether the field can be used to sort the results
Whether highlights can be returned for the field
A default value to use if no value is specified in the document data.
Facets in Amazon CloudSearch
A facet is an index field that represents a category that you want to use to refine and filter search results.
A facet can be any date, literal, or numeric field that has faceting enabled in your domain configuration.
For each facet, Amazon CloudSearch calculates the number of hits that share the same value.
Text Processing in Amazon CloudSearch
During indexing, Amazon CloudSearch processes the contents of text and text-array fields according to the language-specific analysis scheme configured for the field.
An analysis scheme controls how the text is normalized, tokenized, and stemmed, and specifies any stopwords or synonyms to take into account during indexing.
Amazon CloudSearch provides default analysis schemes for each supported language.
Sorting Results in Amazon CloudSearch
You can customize how search results are ranked by defining expressions that calculate custom values for every document that matches your search criteria.
Search Requests in Amazon CloudSearch
You submit search requests to your domain's search endpoint as HTTP/HTTPS GET requests.
You can specify a variety of options to constrain your search, request facet information, control ranking, and specify what you want to be returned in the results.
You can get search results in either JSON or XML. By default, Amazon CloudSearch returns results in JSON.
When you submit a search request, Amazon CloudSearch performs text processing on the search string. The search string is processed to:
Convert all characters to lowercase
Split the string into separate terms on whitespace and punctuation boundaries
Remove terms that are on the stopword list for the field being searched.
Map stems and synonyms according to the stemming and synonym options configure for the field being searched.
By default, Amazon CloudSearch returns search results ranked according to the hits' relevance _scores.
Alternatively, your request can specify the index field or expression that you want to use to sort the hits.
Automatic Scaling
When you create a search domain, a single instance is deployed for the domain.
Amazon CloudSearch automatically scales the domain by adding instances as the volume of data or traffic increases.
When the amount of data you add to your domain exceeds the capacity of the initial search instance type:
Amazon CloudSearch scales your search domain to a larger search instance type.
After a domain exceeds the capacity of the largest search instance type, Amazon CloudSearch partitions the search index across multiple search instances
The number of search instances required to hold the index partitions is sometimes referred to as the domain's width.
As your search request volume or complexity increases, it takes more processing power to handle the load.
When a search instance nears its maximum load, Amazon CloudSearch deploys a duplicate search instance to provide additional processing power.
The number of duplicate search instances is sometimes referred to as the domain's depth.

相关文章