Managing Datasource

All your datasources can be viewed in the Datasources tab that can be accessed from the side menu.

You can either query your datasource using the button or manage its settings using the button.

The details page has the following menus:

View and edit basic information like Name, Description. You can also download your data as a .csv file for "Spreadsheet" datasources.

Joins are a powerful tool in relational databases that allow you to combine data from multiple tables based on related columns. They enable you to create comprehensive views of your data, merging information from different sources to answer complex business questions. Whether you're enriching customer data, analysing transaction records, or connecting sales data with inventory details, joins allow you to query across different datasets efficiently.

Power of Joins:

Combine Data: By linking two or more tables, you can combine datasets that would otherwise remain isolated.
Enhance Insights: Joining tables allows you to enrich your analysis, connecting customer profiles with orders or linking sales with product inventories.
Optimize Queries: Instead of duplicating data across tables, joins let you maintain normalized datasets and still query them as if they were one.

Follow these steps to set up joins between tables in your database for comprehensive data analysis:

Selected Table
Description: Begin by selecting the base table that contains the core data you want to combine with another table. This table serves as the starting point of the join operation.
Select the Column from the Selected Table to Join with Another
Description: Choose the column from the selected table that will be used to establish the relationship with the other table. This column should contain values that match or relate to a column in the second table.
Connect with (Join Type)
Select the appropriate join type based on how you want to combine the data:
- inner_join: Returns only the rows with matching values in both tables.
- left_outer_join: Returns all rows from the left table and matching rows from the right table, with NULL for non-matching rows.
- right_outer_join: Returns all rows from the right table and matching rows from the left table, with NULL for non-matching rows.
- self_join: Joins the table to itself.
- cross_join: Returns the Cartesian product, combining each row from the first table with each row from the second.
Select the Table You Want to Join With
- Description: Next, select the second table that contains the related data you want to join with your base table. This table should hold the data that will complement or expand the base table.
Select the Column from the Selected Table You Want to Join With
- Description: Choose the column from the second table that matches the column you selected in your base table. The data in these columns should have a logical relationship (e.g., shared user_id or product_id).

Example Use Case with Multiple Joins:

First Join:
- Selected Table: marketing_campaigns
- Selected Column: campaign_id
- Join Type: left_outer_join
- Table to Join With: sales_leads
- Selected Column from Target Table: campaign_id
This join links the marketing campaigns to the sales leads they generated, retrieving all campaign data, including campaigns that didn’t generate leads.
Second Join:
- Selected Table: sales_leads
- Selected Column: lead_id
- Join Type: inner_join
- Table to Join With: sales_transactions
- Selected Column from Target Table: lead_id

After performing this join, you may decide to add another join to further enrich the data by linking the sales_leads table with a sales_transactions table, using the lead_id column as the common field. This enables you to analyze the journey from campaign initiation, through lead generation, to final sales transactions.

All structured datasources have a small brain that helps you analyse data based on your queries in natural English. This brain is responsible for understanding ing your query, converting it into an SQL query, executing it and retrieving the required data frame.

The brain is powered by a set of LLMs that you can configure and tweak based on your requirements and performance expectations.

You can configure the LLMs for the following tasks:

SQL Generation : Configure your LLM to generate correct SQL queries
SQL Re-generation : Configure your LLM to attempt re-generation of SQL queries in case of errors or failure of SQL generation LLM
SQL Summarisation : Configure your LLM to summarise the SQL generation process

All of the above tasks have the following settings that can be tweaked:

Select an LLM for Your Tasks

OpenAI: Uses OpenAI’s models like GPT-4 for SQL generation and other tasks.
Fine-tuned: Allows you to use a fine-tuned version of a language model specific to your dataset or use case.
Vertex AI: Google’s Vertex AI platform that allows for scalable model training and deployment.
Claude AI: Developed by Anthropic, this is another powerful LLM for natural language understanding and processing.
Azure OpenAI: Uses Microsoft’s Azure platform to access OpenAI models integrated with Azure’s enterprise-grade infrastructure.

Hyper Parameter Tuning

Choose the specific language model. Each model might vary in size, speed, and capabilities.

Hyper Paramameter Tuning

Temperature:
What it Does: Controls the randomness of the AI’s output. Lower values make the output more focused and deterministic, while higher values make it more creative and diverse.
Example: Setting it to 0 will make the model output more predictable and fact-based.
Top P:
What it Does: Known as "nucleus sampling," Top P controls the diversity of the output by choosing from the smallest possible set of words whose probabilities add up to P.
Example: A Top P of 0.9 means the model considers the top 90% probable tokens when generating responses, adding some variability to the output.
Frequency Penalty:
What it Does: Penalizes the model for using the same words too frequently, promoting more diverse wording in the output.
Example: A higher value discourages repetitive words or phrases in the generated SQL query.
Presence Penalty:
What it Does: Similar to the frequency penalty, but it applies to the presence of specific terms in the output, making the AI less likely to repeat the same concepts.
Example: Setting a higher presence penalty will push the model to explore new topics or avoid repeating the same logic.
Max Output Tokens:
What it Does: Defines the maximum number of tokens (or word pieces) that the model can generate. This limits the length of the SQL query or the response.
Example: A value of 1000 means the output will be capped at 1000 tokens, useful for controlling how verbose or concise the output is.

Pick user examples

Since LLMate works on RAG the system to pulls relevant SQL examples from a managed set, which are then used by the AI model to generate a final SQL query based on the user's request. 'Example to pick' allows users to specify the number of similar examples to send to model, this option improves the relevance and accuracy of the generated query.

Model Prompt

The System Instruction box contains the core logic or detailed guidance for the AI, describing how it should behave and what steps to follow when generating SQL queries. This instruction is not visible to the user but guides the AI in the background.

table_info: This variable contains complete database schema with table and column descriptions that we've filled in setup. It allows the AI to understand the available columns and tables for generating the correct SQL query.

System Instruction Content:
- Objective: The system instruction defines the AI's role, which in this case is to generate syntactically correct BigQuery SQL queries.
- Table Info: The {table_info} variable is a placeholder that will be dynamically replaced with the actual table schema or structure. It provides the AI with the necessary context about the database schema.
- General Guidelines: Specific rules are set for how the AI should format its response:
  - Always respond in markdown format.
  - SQL code should be enclosed in triple backticks with the language specified as sql for correct formatting.
  - If the required columns to answer the query are not present, the AI is instructed not to generate any SQL.

The Human Instruction box represents what the user is asking the AI to do. This prompt acts as the user’s direct request for generating SQL based on a specific query.

user_query: This variable represents the user’s input or query. It dynamically captures the user's request and is used to instruct the AI to generate the appropriate SQL query based on the input.

Human Instruction Content:
- Objective: It instructs the AI to generate SQL based on the user's query ({user_query}).
- Dynamic Query Input: The {user_query} variable is a placeholder for the user's question, such as "What is the total revenue for last month?" or "List all users who signed up in the last week."

PreviousCreation NextQuerying Datasource

Last updated 10 days ago

Select an LLM for Your Tasks

OpenAI: Uses OpenAI’s models like GPT-4 for SQL generation and other tasks.

Fine-tuned: Allows you to use a fine-tuned version of a language model specific to your dataset or use case.

Vertex AI: Google’s Vertex AI platform that allows for scalable model training and deployment.

Claude AI: Developed by Anthropic, this is another powerful LLM for natural language understanding and processing.

Azure OpenAI: Uses Microsoft’s Azure platform to access OpenAI models integrated with Azure’s enterprise-grade infrastructure.

Hyper Parameter Tuning

Choose the specific language model. Each model might vary in size, speed, and capabilities.

Hyper Paramameter Tuning

Temperature:

What it Does: Controls the randomness of the AI’s output. Lower values make the output more focused and deterministic, while higher values make it more creative and diverse.

Example: Setting it to 0 will make the model output more predictable and fact-based.

Top P:

What it Does: Known as "nucleus sampling," Top P controls the diversity of the output by choosing from the smallest possible set of words whose probabilities add up to P.

Example: A Top P of 0.9 means the model considers the top 90% probable tokens when generating responses, adding some variability to the output.

Frequency Penalty:

What it Does: Penalizes the model for using the same words too frequently, promoting more diverse wording in the output.

Example: A higher value discourages repetitive words or phrases in the generated SQL query.

Presence Penalty:

What it Does: Similar to the frequency penalty, but it applies to the presence of specific terms in the output, making the AI less likely to repeat the same concepts.

Example: Setting a higher presence penalty will push the model to explore new topics or avoid repeating the same logic.

Max Output Tokens:

What it Does: Defines the maximum number of tokens (or word pieces) that the model can generate. This limits the length of the SQL query or the response.

Example: A value of 1000 means the output will be capped at 1000 tokens, useful for controlling how verbose or concise the output is.

Pick user examples

Model Prompt

System Instruction Content:

Objective: The system instruction defines the AI's role, which in this case is to generate syntactically correct BigQuery SQL queries.
Table Info: The {table_info} variable is a placeholder that will be dynamically replaced with the actual table schema or structure. It provides the AI with the necessary context about the database schema.
General Guidelines: Specific rules are set for how the AI should format its response:
- Always respond in markdown format.
- SQL code should be enclosed in triple backticks with the language specified as sql for correct formatting.
- If the required columns to answer the query are not present, the AI is instructed not to generate any SQL.

The Human Instruction box represents what the user is asking the AI to do. This prompt acts as the user’s direct request for generating SQL based on a specific query.

Human Instruction Content:

Objective: It instructs the AI to generate SQL based on the user's query ({user_query}).
Dynamic Query Input: The {user_query} variable is a placeholder for the user's question, such as "What is the total revenue for last month?" or "List all users who signed up in the last week."