Managing Datasource
Last updated
Last updated
All your datasources can be viewed in the Datasources tab that can be accessed from the side menu.
You can either query your datasource using the button or manage its settings using the button.
The details page has the following menus:
View and edit basic information like Name, Description. You can also download your data as a .csv file for "Spreadsheet" datasources.
Control specific tables and columns you want to query. Select the exact tables and columns that you want to use for querying data. This setting lets you to manage data access control and helps you add the only data that is relevant for querying.
You can view all your tables in a datasource from the side menu. You can add more information on the table level to make the schema richer. You can add the following information:
Information : To help LLMate provide accurate and insightful answers, it's important to provide a one-time setup of descriptions for every table and column in your database. These descriptions will give LLMate the context it needs to understand and query your data effectively. By taking time to provide clear, thoughtful, and comprehensive descriptions for each table and column, you create strong foundation for smooth, accurate, and efficient natural language (NL) queries in the future.
Concepts : To ensure the AI recognizes and handles specific, custom or complex calculations accurately, follow the structure below to define a Concept.
Joins are a powerful tool in relational databases that allow you to combine data from multiple tables based on related columns. They enable you to create comprehensive views of your data, merging information from different sources to answer complex business questions. Whether you're enriching customer data, analysing transaction records, or connecting sales data with inventory details, joins allow you to query across different datasets efficiently.
Power of Joins:
Combine Data: By linking two or more tables, you can combine datasets that would otherwise remain isolated.
Enhance Insights: Joining tables allows you to enrich your analysis, connecting customer profiles with orders or linking sales with product inventories.
Optimize Queries: Instead of duplicating data across tables, joins let you maintain normalized datasets and still query them as if they were one.
Follow these steps to set up joins between tables in your database for comprehensive data analysis:
Selected Table
Description: Begin by selecting the base table that contains the core data you want to combine with another table. This table serves as the starting point of the join operation.
Select the Column from the Selected Table to Join with Another
Description: Choose the column from the selected table that will be used to establish the relationship with the other table. This column should contain values that match or relate to a column in the second table.
Connect with (Join Type)
Select the appropriate join type based on how you want to combine the data:
inner_join: Returns only the rows with matching values in both tables.
left_outer_join: Returns all rows from the left table and matching rows from the right table, with NULL
for non-matching rows.
right_outer_join: Returns all rows from the right table and matching rows from the left table, with NULL
for non-matching rows.
self_join: Joins the table to itself.
cross_join: Returns the Cartesian product, combining each row from the first table with each row from the second.
Select the Table You Want to Join With
Description: Next, select the second table that contains the related data you want to join with your base table. This table should hold the data that will complement or expand the base table.
Select the Column from the Selected Table You Want to Join With
Description: Choose the column from the second table that matches the column you selected in your base table. The data in these columns should have a logical relationship (e.g., shared user_id
or product_id
).
Example Use Case with Multiple Joins:
First Join:
Selected Table: marketing_campaigns
Selected Column: campaign_id
Join Type: left_outer_join
Table to Join With: sales_leads
Selected Column from Target Table: campaign_id
This join links the marketing campaigns to the sales leads they generated, retrieving all campaign data, including campaigns that didn’t generate leads.
Second Join:
Selected Table: sales_leads
Selected Column: lead_id
Join Type: inner_join
Table to Join With: sales_transactions
Selected Column from Target Table: lead_id
After performing this join, you may decide to add another join to further enrich the data by linking the sales_leads
table with a sales_transactions
table, using the lead_id
column as the common field. This enables you to analyze the journey from campaign initiation, through lead generation, to final sales transactions.
All structured datasources have a small brain that helps you analyse data based on your queries in natural English. This brain is responsible for understanding ing your query, converting it into an SQL query, executing it and retrieving the required data frame.
The brain is powered by a set of LLMs that you can configure and tweak based on your requirements and performance expectations.
You can configure the LLMs for the following tasks:
SQL Generation : Configure your LLM to generate correct SQL queries
SQL Re-generation : Configure your LLM to attempt re-generation of SQL queries in case of errors or failure of SQL generation LLM
SQL Summarisation : Configure your LLM to summarise the SQL generation process
All of the above tasks have the following settings that can be tweaked:
OpenAI: Uses OpenAI’s models like GPT-4 for SQL generation and other tasks.
Fine-tuned: Allows you to use a fine-tuned version of a language model specific to your dataset or use case.
Vertex AI: Google’s Vertex AI platform that allows for scalable model training and deployment.
Claude AI: Developed by Anthropic, this is another powerful LLM for natural language understanding and processing.
Azure OpenAI: Uses Microsoft’s Azure platform to access OpenAI models integrated with Azure’s enterprise-grade infrastructure.
Choose the specific language model. Each model might vary in size, speed, and capabilities.
Hyper Paramameter Tuning
Temperature:
What it Does: Controls the randomness of the AI’s output. Lower values make the output more focused and deterministic, while higher values make it more creative and diverse.
Example: Setting it to 0 will make the model output more predictable and fact-based.
Top P:
What it Does: Known as "nucleus sampling," Top P controls the diversity of the output by choosing from the smallest possible set of words whose probabilities add up to P.
Example: A Top P
of 0.9 means the model considers the top 90% probable tokens when generating responses, adding some variability to the output.
Frequency Penalty:
What it Does: Penalizes the model for using the same words too frequently, promoting more diverse wording in the output.
Example: A higher value discourages repetitive words or phrases in the generated SQL query.
Presence Penalty:
What it Does: Similar to the frequency penalty, but it applies to the presence of specific terms in the output, making the AI less likely to repeat the same concepts.
Example: Setting a higher presence penalty will push the model to explore new topics or avoid repeating the same logic.
Max Output Tokens:
What it Does: Defines the maximum number of tokens (or word pieces) that the model can generate. This limits the length of the SQL query or the response.
Example: A value of 1000
means the output will be capped at 1000 tokens, useful for controlling how verbose or concise the output is.
Since LLMate works on RAG the system to pulls relevant SQL examples from a managed set, which are then used by the AI model to generate a final SQL query based on the user's request. 'Example to pick' allows users to specify the number of similar examples to send to model, this option improves the relevance and accuracy of the generated query.
The System Instruction box contains the core logic or detailed guidance for the AI, describing how it should behave and what steps to follow when generating SQL queries. This instruction is not visible to the user but guides the AI in the background.
table_info: This variable contains complete database schema with table and column descriptions that we've filled in setup. It allows the AI to understand the available columns and tables for generating the correct SQL query.
System Instruction Content:
Objective: The system instruction defines the AI's role, which in this case is to generate syntactically correct BigQuery SQL queries.
Table Info: The {table_info}
variable is a placeholder that will be dynamically replaced with the actual table schema or structure. It provides the AI with the necessary context about the database schema.
General Guidelines: Specific rules are set for how the AI should format its response:
Always respond in markdown format.
SQL code should be enclosed in triple backticks with the language specified as sql
for correct formatting.
If the required columns to answer the query are not present, the AI is instructed not to generate any SQL.
The Human Instruction box represents what the user is asking the AI to do. This prompt acts as the user’s direct request for generating SQL based on a specific query.
user_query: This variable represents the user’s input or query. It dynamically captures the user's request and is used to instruct the AI to generate the appropriate SQL query based on the input.
Human Instruction Content:
Objective: It instructs the AI to generate SQL based on the user's query ({user_query}
).
Dynamic Query Input: The {user_query}
variable is a placeholder for the user's question, such as "What is the total revenue for last month?" or "List all users who signed up in the last week."
Out of the box, LLMate will have a base-line performance. This baseline performance varies based on the complexity of your schema. To ensure production-ready performance, you will can provide LLMate with examples of correct SQL-query pairs.
Within a data source’s “Manage Examples” , you can manage, create, and approve query-SQL pairs.
If you’re using LLMate for the first time, you can ask a business query by clicking the “Query” button. When you ask a question, LLMate will generate SQL and display its output on 'Dataframe" tab. You can also click into previously asked questions from Logs.
2. Validate the SQL
Ensure that the generated SQL is correct from "SQL Response" tab and returns the correct data. You can also view the steps used by LLMate to create SQL. If everything looks good, move to step 4. If the SQL is incorrect, move to step 3.
If the generated SQL is incorrect, you can edit the SQL by clicking on "Manage Response" tab, directly in the code editor. You can then re-run the query to validate new data frame with updated data.
Once you click the "Save Example" button, LLMate immediately starts learning from the Query-SQL pair. You can always come back and edit the business query, its SQL, or even remove the question-SQL pair from the model, from the "Manage Examples" tab. The more validated pairs that you add to your model, the more performant your model will be.
Once you have saved an example it will be added to your Examples list that can be viewed from the "Manage Examples" tab.
The Logs tab allows you to view all the datasource runs. You can view run logs and can validate them and save them as examples after validating them.
You can also use the option to generate these descriptions using an LLM of your choice.