Getting Started with global support for editorial search
Global Support for Editorial Search, a component of Arc XP's commitment to global language support, enables you to search for content within Composer using languages like Korean or Arabic. With this feature, Arc XP users, such as those in newsrooms or content generation teams, can efficiently search for content in non-Latin-based languages directly within Composer.
Note
Global support for editorial search is available only at the time of onboarding to Arc XP. Onboarding customers must decide what languages to use before configuring your site in Arc XP.
Customers who have already onboarded are not eligible to change language analyzers in editorial search.
For a reminder on using the search function within composer, see How to Use Composer Search. That article covers everything you need to know about navigating the search bar and leveraging all the search functions.
Language-based search in Composer: Key concepts
Let's break down how search works in Composer to make it easier for you to find what you need.
Composer search basics
When you're on the Composer home screen, the top search bar is where you start your search journey. You have two options: searching by term or by ID. Searching by ID finds an exact match for the story ID you enter, while searching by term looks for your search keywords in the headline and body of the article, among other places. The information in this article is specifically connected to the "Search by Term" option in Composer.
The default Composer search experience in Arc XP doesn't automatically account for the language of the content or your search terms. Essentially, it treats all content equally, regardless of language. If you want language-sensitive search results, and if your requirements align with what's discussed in How global support for editorial search helps customers, be sure to keep reading this documentation.
Understanding textual search concepts
Exact versus full-text search
Exact search finds an exact match of the search query within the story. Full-text search analyzes your content to find relevant matches based on the meaning of the terms you search for. Let's examine the difference using an example.
Exact match
Search Query: "breaking news"
Story Text: "Today's breaking news is about the stock market."
Exact Match: Story contains the exact phrase "breaking news" as entered in the search query.
Full-text match
Search Query: "breaking news"
Story Text: "Breaking: Major news event reported. Stay tuned for updates."
Full-Text Match: Even though the phrase "breaking news" is not present in the story text, the terms "breaking" and "news" are present separately, making it a relevant match based on the meaning of the search query.
Indexing
Indexing is how Composer’s search engine organizes all the text from stories so it can quickly find what you're looking for. Composer separates the text into smaller parts (or tokens) and stores them in an index structure for easy retrieval. To delve deeper into the specific components of a story that Composer indexes, see Content API: Query Reference.
Text analysis
Before indexing, the search engine performs some text analysis to better understand the content. This involves techniques like tokenization, stemming, and removing stop words.
Tokenization separates text into individual words or phrases. For instance:
Original Text: "The quick brown fox jumps over the lazy dog."
Tokenized Words: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog."]
Stemming reduces words to their root form. For example, "jumping" becomes "jump."
Stop words are common words like "the" or "and" that don't add much meaning and are often removed during analysis.
Here's a straightforward example demonstrating how stemming improves search result relevance:
Search Keyword: "run"
Stemming expands search to include variations like "running," "ran," and "runner"
Ensures all relevant word forms from stories are included in search results.
Here's how tokenizing can improve search relevance:
Indexing the phrase "the quick brown fox jumps" as a single string doesn't match a user search for "quick fox."
Tokenizing the phrase separates each word for individual indexing, as follows: "the", "quick", "brown", "fox", and "jumps".
This enables matching with searches like "quick fox," "fox brown," and other variations.
Language-specific analyzers
Language-specific analyzers are tools that help the search engine process text in different languages. They're tailored to each language's grammar and structure to make searches are more effective and relevant. For instance, Arc XP offers language analyzers for various languages like Korean, Japanese, Arabic, and more. For a comprehensive list of supported language analyzers provided by Arc XP for editorial search, see Languages supported in Arc XP
Improving search accuracy with language-based analyzers: Example
It's essential to understand that while Arc XP’s default standard analyzer performs basic text analysis and tokenizing, it may not accurately tokenize non-Latin-based languages like Mandarin Chinese.
Chinese characters, known as logograms, represent words or morphemes, the smallest meaningful units of language. When combined, their meanings can change to represent entirely new words. Take, for instance, the word "volcano" (火山), which is a combination of:
火: fire
山: mountain/sky
Only a language-specific tokenizer designed for Mandarin would be adept at recognizing that these two logograms should not be separated, as their meaning changes when they are apart.