Skip to main content
AI Chatbots

Data quality in files. Knowledge base

Automatic text generation works by following a specific algorithm and using keywords to search for segments (fragments) similar to the user’s query within the existing knowledge base text and generating an answer from those segments. Therefore, the results of generation are closely tied to the quality of the information in the knowledge base, which can be assessed based on its content and structure.

Content

Better results can be achieved when the knowledge base text is:

  • Precisely and directly formulated in the use of language and word selection, closely aligned with potential user questions and expected answers.
    • People often expect artificial intelligence to infer more than is explicitly stated in the text, but this usually does not happen. For example: If the knowledge base file contains the sentence: "Our world: fire, earth, air, water." And the user asks in the chat: "Name the four elements!" The automatic generation will not produce an answer. However, if the knowledge base contains the sentence: "The world consists of four elements: fire, earth, air, and water,"then an answer will be generated.
    • Another example relates to expectations that artificial intelligence can handle generalized concepts, which it currently does not do effectively. For example, if the knowledge file contains the sentence: "The earth element includes stones and rocks," and the user asks: "What element does feldspar belong to?" AI might just as easily provide an answer about air instead of earth, as it knows nothing about feldspar. Artificial intelligence can't make human-like judgments. For example, if the knowledge base has instructions for getting a new uniform, it won't answer the question, "My boots are broken, what should I do?"
    • It is sometimes expected that automatic text generation will evaluate extensive amounts of data and generate highly precise, meaningful, and specific answers, including the smallest nuances and reservations. However, this is not the case. If the knowledge base contains general legal acts, but the user question requires analyzing several documents, there is no guarantee that the generated answer will mention all nuances of the law from all documents.
  • With comprehensive descriptive information and explanation.
    • Example: If the knowledge base file contains many tables without titles or references to images without detailed explanations, when a user asks any question about the content of a table or image, no answer will be provided because the automatic text generation system will have no textual clues on how to find the segments containing the necessary data. To ensure information about images or tables is provided, they should include descriptive information about their content.

Structure

Segments. To generate an automatic response to a user’s query, a request is sent to the large language model with the most likely relevant fragments (or segments) of knowledge base information based on similarity. Thus, the accuracy of the response depends on whether the necessary segment from the knowledge base is included in the request.

When a new file is added to the knowledge base, it is automatically divided into segments. The segment breakdown can be viewed in the file's Segments tab (also by clicking the magnifying glass icon if the file has been tagged as "public").

A new segment is also automatically started with each heading formatted by inserting hashtags (#, ##, or ###) at the beginning. However, sometimes conceptually related information is divided into multiple segments.

For instance: the first sentence of a paragraph might be attached to the previous segment, while the rest of the paragraph is in the next segment. Or, in the case of a longer list, the first items may be in one segment with the introductory sentence, while the remaining items are in the next segment without context. In other cases, entirely different conceptual information might be combined into one segment. In such cases, incorrect or incomplete segments may be included in the request, and automatic generation may produce incorrect answers. It is possible to manually change segment boundaries by inserting the start of a new segment marker – four hashtags (####) – on a new line above the text to be separated. This is the recommended method for enhancing the knowledge base with independent, simple questions and answers.

For example:

#### 

What is your address?

Our address is Vienības gatve 4a.



####

How old are you?

The company was founded in 2024.

Formatting the text. To make the data easier to navigate and process (especially for tables), it is recommended to format it using markdown syntax. Some examples can be found in this article: Formatting virtual asisstant's outputs.

Recommendations for working with data

  1. Adding a data file:

You can add files in various formats (Word, PDF, etc.) to the knowledge base. The content will be stored on the site as a markdown file. Before uploading the file to the site, you can try converting its content to markdown format using various online tools and plugins—this may result in a more complete and appropriate format for the specific file's needs.

  1. Reviewing:

Initial data may vary in quality, and no conversion tool produces perfect results, especially if the file contains additional information such as tables, images, and charts. Therefore, the content of the file must be reviewed: ensuring everything necessary has been uploaded, converted correctly, and remains in the final data.

  1. Image Processing:

If the original file contains an image not available on the web, it can be added to the site's Resources subview. You can reference the new image link (right-click the image and select "Copy image address") in the knowledge base file.

Example:

![Image 1](https://va.tilde.com/api/prodk8sboticecr0/media/staging/icecream4985161_1280.png)
  1. Testing: After adding and organizing a new file, it is worth testing by asking questions about the document's topic. This way, you can determine if the answers provided are correct and review whether segment adjustments are needed.

Maintenance

  • It is recommended to keep only current data in the knowledge base and delete outdated and irrelevant information.
  • Changing data should be stored in a separate file for easier regular access and updates.
Important!

Responses generated by automatic text generation can vary each time and may be inaccurate or incomplete.


Read more

  • Knowledge test, prompt
  • The training view “Knowledge base”
  • Function "generateanswer"
  • Automatic text generation
  • Good Document for Knowledge Base