134
1 064 658

4:05

We put Pandas, Polars, and DuckDB to the test! which comes out on top? | Data Tools

3:44

Transform data lake to data lakehouse using Apache Iceberg | Real time ETL | Kafka | Data Lake

7:20

Configure & run Elementary data test in dbt | Data Tests | DBT Tests | Data Observability | P2

3:18

Real time ETL: Integrate Kafka Data Stream with a Data Lake | Kafka | Data Stream | Data Lake

9:17

How to integrtae Pandas AI with local LLM using Ollama? | Private & Free | Ollama | AI | llama3

4:15

NLP & AI database integration | Get Insights from database using NLP | Chat with database | AI | NLP

Get ready for some exciting tech🚀!
In this video we're building a Streamlit app that helps you get insights from a SQL database using Natural Language (NLP)! Imagine being able to ask questions like "What's the total sales?" or "Which products are most popular?" and getting instant answers from your SQL database, all without leaving your local machine!
We are using Llama 3, an open-source Large Language Model (LLM) that runs locally on our machine. This means we keep our data safe and secure within our own network. Here is how you can set up Ollama and OpenWebUI for the local LLM set up:
Ollama: ua-cam.com/video/CE9umy2NlhE/v-deo.html
OpenWebUI: ua-cam.com/video/YUYZd71hg3w/v-deo.html
Link to AI Playlist: hnawaz007.github.io/ai.html
DBT series for database development: hnawaz007.github.io/mds.html
How to install Postgres & restore sample database: ua-cam.com/video/fjYiWXHI7Mo/v-deo.html
Follow the step-by-step guide on how to build this app and unlock the power of natural language insights from your data.
Link to GitHub repo: github.com/hnawaz007/pythondataanalysis/tree/main/NLP%20%26%20AI%20database%20integration
#ai #chatwithdatabase #opensourceai
Link to Channel's site:
hnawaz007.github.io/
--------------------------------------------------------------
💥Subscribe to our channel:
ua-cam.com/users/HaqNawaz
📌 Links
-----------------------------------------
#️⃣ Follow me on social media! #️⃣
🔗 GitHub: github.com/hnawaz007
📸 Instagram: bi_insights_inc
📝 LinkedIn: www.linkedin.com/in/haq-nawaz/
🔗 medium.com/@hnawaz100
🚀 hnawaz007.github.io/
-----------------------------------------
Topics in this video (click to jump around):
==================================
0:00 - Overview of the App
1:18 - Custom LLM for SQL
2:40 - Develop LangChain Chain
3:52 - First Chain to Generate SQL
3:53 - Second Chain database interface
5:55 - Streamlit App
6:46 - Test the NLP and AI database integration
8:01 - Use Cases for this App

Відео

How to integrate api data using python & dlt | API | Data Load Tool | ETL | Python | Postgres

4:05

Переглядів 57321 день тому

🚀 Extract, Transform, Load (ETL) like a pro! using dlt. In this video we load data from an Exchange Rate API into our Postgres database using Python and dlt! We extract the data from the API, transform it to fit our schema, and load it into our database using dlt pipeline. We can get the data ready in minutes for analysis and visualization. Python and dlt combination makes this process a breeze...

We put Pandas, Polars, and DuckDB to the test! which comes out on top? | Data Tools

3:44

We put Pandas, Polars, and DuckDB to the test! which comes out on top? | Data Tools

Переглядів 474Місяць тому

Data Speed Showdown! We put Pandas, Polars, and DuckDB to the test: which one comes out on top for data read and transformation speed? The results are in: DuckDB takes the lead, with Polars a close second, and Pandas trailing slightly behind. So, what's the takeaway? If you're working with smaller datasets and prioritize ease of use, Pandas is still an excellent choice. For larger datasets in t...

Transform data lake to data lakehouse using Apache Iceberg | Real time ETL | Kafka | Data Lake

7:20

Transform data lake to data lakehouse using Apache Iceberg | Real time ETL | Kafka | Data Lake

Переглядів 948Місяць тому

🚀 Exciting News! Today we're transforming the open source data lake to a data lakehouse! 🌊📊 🌊 Imagine combining the best of data lakes and data warehouses into one powerful, unified system. That's exactly what a data lakehouse offers! 🌟 🔍 Key Benefits: Scalability & Flexibility: Easily manage vast amounts of structured and unstructured data. Cost-Efficiency: Optimize storage costs with tiered d...

Configure & run Elementary data test in dbt | Data Tests | DBT Tests | Data Observability | P2

3:18

Configure & run Elementary data test in dbt | Data Tests | DBT Tests | Data Observability | P2

Переглядів 291Місяць тому

In this video we are covering 🚀 Elementary Data Tests. Elementary is the DBT native observability tool! 📊✨ Say goodbye to guesswork and hello to insightful data pipeline observablility with Elementary. 🎉 Configure Elementary test just like a dbt test in the schema yaml file. Elementary tests are executed just dbt native test via "dbt test" command. 💡 With Elementary, you can: 🔍 Monitor DBT runs...

Real time ETL: Integrate Kafka Data Stream with a Data Lake | Kafka | Data Stream | Data Lake

9:17

Real time ETL: Integrate Kafka Data Stream with a Data Lake | Kafka | Data Stream | Data Lake

Переглядів 1,5 тис.Місяць тому

🚀 Exciting News! We're covering a powerful integration between Apache Kafka Data Streams and open source Data Lake! 🌊📊 Imagine harnessing the real-time processing power of Apache Kafka with the scalable storage capabilities of MinIO. This dynamic duo is set to transform how we handle real-time data, enabling seamless, high-performance data ingestion, processing, and storage. 🔹 Real-Time Data In...

How to integrtae Pandas AI with local LLM using Ollama? | Private & Free | Ollama | AI | llama3

4:15

How to integrtae Pandas AI with local LLM using Ollama? | Private & Free | Ollama | AI | llama3

Переглядів 7492 місяці тому

Ever wished your data analysis could be smarter, faster, and more intuitive? Meet Pandas AI, the revolutionary tool that merges the power of pandas with the intelligence of generative AI. Whether you're a data scientist, analyst, or just a data enthusiast, Pandas AI is here to transform the way you work with data. 🔍 Key Features: Intelligent Data Handling: Seamlessly integrates with pandas to e...

Exciting new data observability tool for dbt | dbt native | Monitor dbt runs | data quality | P1

5:13

Exciting new data observability tool for dbt | dbt native | Monitor dbt runs | data quality | P1

Переглядів 4972 місяці тому

This week we are covering an🚀 Exciting new tool Elementary, your go-to DBT native observability tool! 📊✨ Say goodbye to guesswork and hello to insightful data pipeline observablility with Elementary. 🎉 Built exclusively for the DBT community, Elementary provides unparalleled visibility into your data pipelines, empowering you to optimize, troubleshoot, and accelerate your analytics workflows li...

How to add dbt seed to dbt project & load reference data | dbt seeds | load data to Datawarehouse

4:20

How to add dbt seed to dbt project & load reference data | dbt seeds | load data to Datawarehouse

Переглядів 3832 місяці тому

In this video we are covering dbt seeds. Dbt provides the seed function to import CSV data. By default dbt looks for csv files in the seeds directory. Using the seed command we can load CSV files to our data warehouse with a simple command. Dbt makes this very easy. The CSV files are located in our dbt repository, they are version controlled and code reviewable. Seeds are best suited to static ...

ETL Incremental Data Load Approach Using DLT | Source Change Detection | Load New & Change Data

5:11

ETL Incremental Data Load Approach Using DLT | Source Change Detection | Load New & Change Data

Переглядів 8802 місяці тому

In this video we use the data load tool (dlt) library. We will explore the date based incremental data load using dlt. Previously we have covered the auto incrementing id incremental load approach. We use the source change detection technique. With this approach we detect what rows have changed and only pull the change data for the ETL operation. This is an optimal method of moving between syst...

How to perform ETL Incremental Data Load using DLT | Data Load Tool | ETL | Python

6:13

How to perform ETL Incremental Data Load using DLT | Data Load Tool | ETL | Python

Переглядів 1,2 тис.3 місяці тому

In this video we will continue with the data load tool (dlt) library. We will explore how to perform incremental data load using dlt. The incremental data load in ETL (Extract, Transform and Load) is the act of loading only new or changed data. With this approach we process minimal data, use fewer resources and therefore less time. DLT refers to this as the merge write disposition. We keep the ...

How to connect to Postgres using Python | Query SQL Database | Pandas | Postgres

3:40

How to connect to Postgres using Python | Query SQL Database | Pandas | Postgres

Переглядів 1,1 тис.3 місяці тому

In this video we are covering how to connect to Postgres databases using Python. This is a common question that comes up in the ETL series. So, I decided to cover it and direct viewers to it if they are facing connectivity issues. There are few prerequisites for this. We need the Postgres database installed and configured. Link: ua-cam.com/video/fjYiWXHI7Mo/v-deo.html We use a Jupyter notebook ...

data load tool (dlt) build database data pipeline | verified source | data pipeline | etl | Python

7:37

data load tool (dlt) build database data pipeline | verified source | data pipeline | etl | Python

Переглядів 1,2 тис.3 місяці тому

In this video we cover data load tool (dlt) verified source module. Verified Source is a Python module that allows creating pipelines that extract data from a particular Source.We add sql database sources to our dlt project. This helps us write clean data pipleines using dlt's built-in modules. dlt is a Python based ETL (ELT) tool. If you need to catch up on dlt then here is the link to the ini...

how to build data pipelines with data load tool (dlt) | data pipeline | etl | Python

5:51

how to build data pipelines with data load tool (dlt) | data pipeline | etl | Python

Переглядів 2,2 тис.4 місяці тому

In this video we are covering an exciting new Python library. We have covered the data build tool commonly known as dbt. It is a data transformation python library. It decopuled the Extract, Transform and Load process or ETL. This covered the T in the ETL. We are left wanting for the EL process. Now we have the data load tool library in Python. This as the name suggests does the Extract and Loa...

How to integrate Great Expecation Data Quality tests in Airflow? | Data pipeline | Data Quality

4:29

How to integrate Great Expecation Data Quality tests in Airflow? | Data pipeline | Data Quality

Переглядів 1,2 тис.4 місяці тому

In this video, we will cover how to integrate Great Expectation Data Quality tests in Apache Airflow. In this session, we will use the Great Expectation (GE) provider for Airlow and run the Great Expectations suite. Our data asset will be a PostgreSQL table. In this tutorial, we will see how to test an ETL Pipeline with Great Expecations using Python. It is essential to test the quality of data...

Build custom private chatgpt that produces SQL | Custom LLM | Open Source LLM | Create Custom Model

9:24

Build custom private chatgpt that produces SQL | Custom LLM | Open Source LLM | Create Custom Model

Переглядів 5134 місяці тому

Build custom private chatgpt that produces SQL | Custom LLM | Open Source LLM | Create Custom Model

How to run LLM Locally? | Integrate LLM in your APP | Build with LLM | Ollama | Streamlit

5:04

How to run LLM Locally? | Integrate LLM in your APP | Build with LLM | Ollama | Streamlit

Переглядів 1,1 тис.5 місяців тому

How to run LLM Locally? | Integrate LLM in your APP | Build with LLM | Ollama | Streamlit

How to create Great Epxectations suite? Quality Checks for Data Pipelines | Data Quality

8:36

How to create Great Epxectations suite? Quality Checks for Data Pipelines | Data Quality

Переглядів 1,9 тис.5 місяців тому

How to create Great Epxectations suite? Quality Checks for Data Pipelines | Data Quality

How to navigate the channel and find content on this channel? | Channel's Website |

3:01

How to navigate the channel and find content on this channel? | Channel's Website |

Переглядів 1,6 тис.5 місяців тому

How to navigate the channel and find content on this channel? | Channel's Website |

Polars a multi-threaded lightning fast Python Library | The next Big Python Data Science Library

5:33

Polars a multi-threaded lightning fast Python Library | The next Big Python Data Science Library

Переглядів 5396 місяців тому

Polars a multi-threaded lightning fast Python Library | The next Big Python Data Science Library

Data Lakehouse workflow Apache Iceberg and Nessie | How Iceberg works | Nessie Branch & Merge

10:35

Data Lakehouse workflow Apache Iceberg and Nessie | How Iceberg works | Nessie Branch & Merge

Переглядів 1,1 тис.7 місяців тому

Data Lakehouse workflow Apache Iceberg and Nessie | How Iceberg works | Nessie Branch & Merge

Create on premise Data Lakehouse with Apache Iceberg | Nessie | MinIO | Lakehouse

10:44

Create on premise Data Lakehouse with Apache Iceberg | Nessie | MinIO | Lakehouse

Переглядів 4,1 тис.8 місяців тому

Create on premise Data Lakehouse with Apache Iceberg | Nessie | MinIO | Lakehouse

Orchestrate Airbyte & dbt with Dagster | Orchestrate Modern Data Stack with Dagster | Airbyte | dbt

14:08

Orchestrate Airbyte & dbt with Dagster | Orchestrate Modern Data Stack with Dagster | Airbyte | dbt

Переглядів 3,4 тис.9 місяців тому

Orchestrate Airbyte & dbt with Dagster | Orchestrate Modern Data Stack with Dagster | Airbyte | dbt

dbt Power User extension for VS Code | Accelerate your dbt development like Pros | dbt

7:46

dbt Power User extension for VS Code | Accelerate your dbt development like Pros | dbt

Переглядів 3,9 тис.9 місяців тому

dbt Power User extension for VS Code | Accelerate your dbt development like Pros | dbt

Kafka Real-Time data analysis with Streamlit | Kafka | Data Streaming | Clickhouse | Real-Time

7:28

Kafka Real-Time data analysis with Streamlit | Kafka | Data Streaming | Clickhouse | Real-Time

Переглядів 2,8 тис.10 місяців тому

Kafka Real-Time data analysis with Streamlit | Kafka | Data Streaming | Clickhouse | Real-Time

Set up Clickhouse database for Kafka Streaming | Data Steraming | OLAP database | Clickhouse

7:07

Set up Clickhouse database for Kafka Streaming | Data Steraming | OLAP database | Clickhouse

Переглядів 3,6 тис.10 місяців тому

Set up Clickhouse database for Kafka Streaming | Data Steraming | OLAP database | Clickhouse

Dagster Orchestrate Jupyter Notebook | Jupyter Notebook | Schedule Notebooks with Dagster

7:58

Dagster Orchestrate Jupyter Notebook | Jupyter Notebook | Schedule Notebooks with Dagster

Переглядів 1,6 тис.10 місяців тому

Dagster Orchestrate Jupyter Notebook | Jupyter Notebook | Schedule Notebooks with Dagster

How to install Dagster on Docker? | Build Custom Dagster Docker Image | Dagster | Docker

6:43

How to install Dagster on Docker? | Build Custom Dagster Docker Image | Dagster | Docker

Переглядів 2,9 тис.11 місяців тому

How to install Dagster on Docker? | Build Custom Dagster Docker Image | Dagster | Docker

dbt build Star Schema using dimensional modeling | data modeling with dbt | build dims & fact | P4

14:02

dbt build Star Schema using dimensional modeling | data modeling with dbt | build dims & fact | P4

Переглядів 3,9 тис.11 місяців тому

dbt build Star Schema using dimensional modeling | data modeling with dbt | build dims & fact | P4

Orchestrate SQL Data Pipelines with Airflow | Schedule SQL scripts with Airlfow | ETL with SQL

3:34

Orchestrate SQL Data Pipelines with Airflow | Schedule SQL scripts with Airlfow | ETL with SQL

Переглядів 2,5 тис.Рік тому

Orchestrate SQL Data Pipelines with Airflow | Schedule SQL scripts with Airlfow | ETL with SQL

КОМЕНТАРІ

@Pattypatpat7122 День тому
This was great, much easier on my Windows machine than my Linux machine for a change. Just a question, your table definitions in the video for AdventureWorks don't appear to be the same as the available ones on the Microsoft site for versions 2019 or 2022. I created some dummy tables based on the same table definitions in your GitHub, but obviously my dummy data doesn't relate, so I can't properly test if this model is properly generating the correct SQL. Do you have a link to the database you were using?
@aniketrele7688 3 дні тому
Hi, Is the connector name and topic name always same? Can you name your ropic something else? If you want to have multiple topic for 1 connector then it will be helpful. Thanks in advance.
@BiInsightsInc 2 дні тому
Hi there, no your connector name can be different than your topic name. You can have multiple connectors read from the same topic.
@junaidmalik660 5 днів тому
thanks a lot for the detailed video, i want to ask about he accuracy of results? is it accurate or not for big datasets
@BiInsightsInc 2 дні тому
The results are good on various data seizes. However, you should be careful with the data size. PandasAI uses a generative AI model to understand and interpret natural language queries. The model has a token limit and if your data exceeds that limit then it won’t be able to process your request.
@tiagovianez8482 8 днів тому
Teacher, where is the source of this data? I would like to insert them into my database. In my case I will insert it into PostgreSQL, run the ETL and write it to s3. Could you provide me with the source?
@BiInsightsInc 7 днів тому
Hi the data source is a MS SQL Server sample database called Adventureworks. You can download and restore it. I have a tutorial on how to install SQL Server and restore this database. Here is a link: ua-cam.com/video/e5mvoKuV3xs/v-deo.html
@krishnarajuyoutube 12 днів тому
can we run llama 3 locally on any simple VPS Server, or do we need GPUS ?
@BiInsightsInc 12 днів тому
Hi you'd need a gpu to run llm. By the way VPS servers can have GPUs.
@diwaspoudel7 13 днів тому
Hi there do you have dockeryml file containing mssql connection
@BiInsightsInc 12 днів тому
Yes, I have done a video on it where I install additional sql server providers and connect to sql server. Here is the link: ua-cam.com/video/t4h4vsULwFE/v-deo.html&lc=UgxQFElBNgK2dwKo5kV4AaABAg
@mohdmuqtadar8538 17 днів тому
Great video What if the response from database exhaustes the context window of the model.
@BiInsightsInc 16 днів тому
Thanks. If you are encountering model's maximum context length then you can try the following. 1. Choose a different LLM that supports a larger context window. 2. Brute Force Chunk the document, and extract content from each chunk. 3. RAG Chunk the document, only extract content from a subset of chunks that look “relevant”. Here an example of these from LangChain. js.langchain.com/v0.1/docs/use_cases/extraction/how_to/handle_long_text/
@GordonShamway1984 17 днів тому
Wonderful as always and just in time. Was going to build a similar use case that auto generates database docs for business users next week. This comes in handy🎉 Thank you again and again
@BiInsightsInc 17 днів тому
Glad it was helpful! Happy coding.
@KevinHa-wg8qv 18 днів тому
Hi. I encountered this error when trying to add debezium connector via api call. Would you please help? Thanks. Failed testing connection for jdbc:postgresql://localhost:5432/AdventureWorks with user 'etl': Connection to localhost:5432 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections. [io.debezium.connector.postgresql.PostgresConnector]
@ryanschraeder8681 24 дні тому
What happens if you kill the airflow web server, or localhost? Will the DAG still run on the schedule you specified?
@BiInsightsInc 24 дні тому
If the services are down then DAG won’t run. You want to make sure your server remains on for the DAG to execute on schedule.
@gustavoleo 25 днів тому
Namaste Haq !!! Thank you so much for making this video!, and also sharing your repo, I'm bit confused how you build the connection string. would you mind to share it? UI had checked you Connect to SQL Server with Python notebook also, but didn't realize what's is not correct on my ConnectionStringCredentials()!
@BiInsightsInc 25 днів тому
Thanks. The connection strings defined in the secrets.toml file. I have covered it in the initial videos. You can watch them here. ua-cam.com/video/y9ooIJ7qibU/v-deo.html ua-cam.com/video/niml1EsMy9o/v-deo.html
@dltHub 26 днів тому
❤ Thank you for this amazing video!
@rafaelg8238 27 днів тому
Great video 👏🏻
@cvarak3 Місяць тому
Hi, would you suggest this method to extract data from an active postgres table that has ~5billion rows? If not do you have any videos on what method you would suggest to extract from postgres to s3? Thanks! (Tried with airbyte but keeps failing)
@BiInsightsInc 29 днів тому
Hi, if you have a kafka cluster running then you can stream data from Postgres to Kafka. A cluster can handle large dataset. You can stand up your own or utilize confluent cloud. Once this set up is in place then configure an S3 sink connector. I have covered that in the following vidoe. ua-cam.com/video/j_dEUpV9sCo/v-deo.html
@danielvoss2483 Місяць тому
Great job, keep going 👍
@coolkillakhan Місяць тому
i lover uoui
@dmunagala Місяць тому
def test_Genre_dtype_str(df): assert (df["Genre"].dtype == str or df["Genre"].dtype == 'O') This test case is always returned Pass
@BiInsightsInc Місяць тому
If the data type of this column is string or object then it will be pass. If you have datatype of Int or float then it will fail. You can also remove the "O" and test for string if that's the objective. Here is an example of this test with int. github.com/hnawaz007/pythondataanalysis/blob/main/ETL%20Pipeline/Pytest/Session%20one/string%20and%20object%20test%20result.png
@dmunagala Місяць тому
@@BiInsightsIncThanks for responding. When I have the column value as 1, which is int below assertion is passing. I tried to remove "O" and then it's failing but it fails even if the data type is string. assert (df["Genre"].dtype == str or df["Genre"].dtype == 'O')
@BiInsightsInc Місяць тому
@@dmunagala you need to check the data type. Value might be 1 but it can be stored as string. Check my previous comment I have link to this test and it’s failing with int data type.
@dmunagala Місяць тому
@@BiInsightsInc Yes, you are right. I checked the datatype by using, df.info() and got to know the exact datatypes for all columns in my csv file. It is working as expected. Thank you so much for your help, you are amazing!!
@nitintharwani Місяць тому
Thanks for sharing this, I was trying to set this whole thing on local, I am able to run queries using Trino CLI inside docker but I am not able to connect DBeaver to Trino, can you share DBeaver connection configuration??
@BiInsightsInc Місяць тому
Thanks. I have covered the trino and DBeaver configuration in the data lake video. Here is the link: ua-cam.com/video/DLRiUs1EvhM/v-deo.html
@streambased Місяць тому
Love how Kafka is turning into a datalake now that you can have unlimited retention at very cheap cost (KIP-405). This means that you can reduce data movement by bringing analysts directly where the data was ingested. This opens a plethora of new data sources and much greater volume of data available for ad hoc analysis. We can finally say bye to the cost, complexity and consistency issues associated with heavy ELT/ETL processes!
@vladimirborisov3586 Місяць тому
Hi! Thank you for this great video! I'm still not very clear about why we need hive if we already use iceberg - in iceberg, we also can get a schema from minio raw data directly and we twice create the same table in hive and iceberg, but we will use iceberg most of the time. Did I miss something?
@BiInsightsInc Місяць тому
Thanks. In my experience Iceberg cannot directly read data from s3 using Trino connector. We create an Iceberg table and insert data into it. However, there are exception for tools that have native integration with Iceberg. For example we either need a Rest Catalog or Spark to create/populate Iceberg table from S3 directly. Let me know if you have seen an S3 example with Trino without Hive Metastore.
@rafaelg8238 Місяць тому
Great video 👏🏻
@priyamuthusamy6735 Місяць тому
Thanks for the video. I installed pip install jupyter_scheduler but create notebook job option is not listed. Could you please help
@BiInsightsInc Місяць тому
Here is the Jupyter Scheduler official install guide. Please check the requirement section to make sure your environment meet all the requirements. jupyter-scheduler.readthedocs.io/en/latest/users/index.html#installation
@dangelov3402 Місяць тому
Help me please, I can't download the output files when i clicked download icon, Nothing happened at all 😢
@BiInsightsInc Місяць тому
Can you provide a little more context? What are you trying to download and from where?
@dangelov3402 Місяць тому
I installed jupyterhub using by docker on Windows Subsystem linux and installed a jupyter schedule extension in the docker container of my account. It worked, but i can't download the output files when i clicked icon. There was no error message, and nothing happened.
@BiInsightsInc Місяць тому
@@dangelov3402 You should be able to access your windows system under the /mnt directory. Also, you can use VS Code (Remote Explorer) extension to browse docker container's file system and copy and/or paste files between Windows and the container. Here is an example of the mnt directory: superuser.com/questions/1324069/how-to-copy-a-file-from-windows-subsystem-for-linux-to-windows-drive-c
@dangelov3402 Місяць тому
I founded a solution, this is a bug in version 2.7.1, I installed version 2.5.2, no problem. Thank you so much
@ahmetaslan9261 Місяць тому
Thanks for the informative content! so how would you deal with large tables especially for the initial loads? assume that table is 200-300GB, selecting all data and keeping in data frames/ in memory objects doesn't look practical so I believe defining a batch key/partitions on source side and iterate it on codes could be a way..
@BiInsightsInc Місяць тому
Thanks. It seems you are working with a large dataset. You can partition your data and also use the chunksize to limit the amount of data read into memory, if your source supports the chunksize. Keep in mind Pandas is single threaded so it will be slow processing this size dataset. However, there are specialized tools for large datasets i.e. Flink, Polars and Spark. Flink and Spark are designed for this and they run on a cluster therefore they are not restricted by a single machine's limits. So they are worth exploring for large datasets. I have covered PySpark on the channel if you are interested: ua-cam.com/play/PLaz3Ms051BAkwR7d9voHsflTRmumfkGVW.html Here are the links to these frameworks: flink.apache.org/ spark.apache.org/docs/latest/api/python/index.html
@ahmetaslan9261 Місяць тому
@@BiInsightsInc Thank you so much for your time and prompt response. Definitely I'll go through your posts related to PySpark.
@kamanlimbu Місяць тому
This may be helpful to anyone who is facing the issue I go it myself so excited to share. If you are getting this error "Data load error: (psycopg2.errors.InsufficientPrivilege) permission denied for schema public LINE 2: CREATE TABLE "stg_DimProduct" (" First install pip install psycopg2-binary and then I suggest to give permission to etl user. The creator of the video has provided the github repo so he has actually provided the script. But for your convenience you can run this code in your postgresql query tool in pgadmin "CREATE SCHEMA IF NOT EXISTS etl AUTHORIZATION postgres; GRANT ALL ON SCHEMA etl TO etl; GRANT ALL ON SCHEMA etl TO postgres;" and then run the code the data will be loaded/imported to the postgresql database....
@alaab82 Місяць тому
one of the best tuto's on youtube, thank you so much !
@fj578 Місяць тому
Very informative
@BiInsightsInc Місяць тому
Glad you liked it
@gobindamohan Місяць тому
Thanks for this awesome man
@MohacelHosen-cd8iz Місяць тому
i subscribe , send my phone to Bangladesh, Uttara, Sector#13, Road#18. I prefer macbook
@patrickblankcassol4354 Місяць тому
Thank you for the vídeo, excelent explanation.
@saadlechhb3702 Місяць тому
Thank you, i have a question If the task is scheduled to run daily and new data has been inserted into the source since the last transfer, will just new data get transferred next task or all data again
@BiInsightsInc Місяць тому
This would bring in all data. This is a truncate and load approach. If you need to bring only newly inserted data then you want to look into the incremental data load approache(s). I have covered those in the following videos: ua-cam.com/video/a_T8xRaCO60/v-deo.html ua-cam.com/video/32ErvH_m_no/v-deo.html
@saadlechhb3702 Місяць тому
@@BiInsightsInc i can make scripte that merge between the incremental load and aiflow ?
@fernandomaximoferreira1067 Місяць тому
Awesome tutorial.
@kashifrana6798 Місяць тому
Great explanation. What library is great for styling? Xlswriter or openpyxl?
@BiInsightsInc Місяць тому
Choosing between openpyxl and wlsxwriter boils down to the specific requirements of your project. If your work involves a lot of interaction with existing Excel files, especially those requiring advanced features, openpyxl probably right for you. Conversely, if your focus is on generating new, large, well-formatted Excel reports from scratch, xlswriter is your best bet.
@kashifrana6798 Місяць тому
@@BiInsightsInc thank you!
@girmamoges941 Місяць тому
I heard that in order to access on premises MS SQL database , the server must be accessible through the internet is that always true. As per your example you have mentioned on premises if that is the case what are the settings in Airflow to connect to ms sql server. what is need to install or what should be in the Docker file or package or requirements file in the project folder .
@BiInsightsInc Місяць тому
You have to be careful when exposing a databse to internet. This can result in a whole heaps of issues. Anyways, If you need to connect to your SQL Server over the Internet, the safest and preferred method is to setup a VPN connection between the SQL Server and the clients that need to access it via the Internet or white list the IP addresses of the clients(s) that will be accessing it. If you are doing it for testing purposes then exposing the SQL Server port via port forwading will do the job.
@rolandrooseveltagodzo5974 Місяць тому
Thanks for making it so simple...the best so far for beginners i have seen
@saadlechhb3702 Місяць тому
Hello,when i hit docker-compose up, i get no configuration file provided:not found, and when i tried to transfer another yaml file from different source in github to myfolder i get invalid spec: workspace:: empty section between colons, and i don't know how to solve the problem
@BiInsightsInc Місяць тому
You want to make sure you have docker and docker compose installed. Also, make sure you are in the right directory.
@rafaelg8238 Місяць тому
great project, congrats. keep going on
@BiInsightsInc Місяць тому
Thanks. More to come on this.
@vladimirborisov3586 Місяць тому
Hi! Thank you for the video - it's great explanation as always on your channel! I have a questions. I have similar task starting from kafka and now I'm using iceberg/dremio/nessie stack for storing the data from your previous video. Here you have added hive - could you explain what's benefits of using hive with or instead of stack from your previous data lakehouse guide. Thanks!
@BiInsightsInc Місяць тому
Thanks. Nessie catalog is feature rich, Git integration, and I personally like it over Hive. However, dremio is a cloud native and offers an open source option. So I'd read the fine print and what's allowed commercially. That's the only catch. Otherwise, your setup optimal for streaming and storage. This implementation is fully open source and you can deploy it commercially. Both options offer similar capabilities and I will cover more as the current connector limits the iceberg's ACID capabilities. More to come on that.
@amymorrison5615 Місяць тому
🔥🎆😍
@BiInsightsInc Місяць тому
Link to Kafka series: ua-cam.com/play/PLaz3Ms051BAkwR7d9voHsflTRmumfkGVW.html Link to Data Lake video: ua-cam.com/video/DLRiUs1EvhM/v-deo.html Link to data lake GitHub repo: github.com/hnawaz007/pythondataanalysis/tree/main/data-lake Link to Kafka GitHub repo: github.com/hnawaz007/pythondataanalysis/tree/main/kafka
@peezhead Місяць тому
This is what happened after I did pip install virtualenv: Collecting virtualenv Downloading virtualenv-20.26.2-py3-none-any.whl.metadata (4.4 kB) Collecting distlib<1,>=0.3.7 (from virtualenv) Downloading distlib-0.3.8-py2.py3-none-any.whl.metadata (5.1 kB) Collecting filelock<4,>=3.12.2 (from virtualenv) Downloading filelock-3.14.0-py3-none-any.whl.metadata (2.8 kB) Collecting platformdirs<5,>=3.9.1 (from virtualenv) Downloading platformdirs-4.2.2-py3-none-any.whl.metadata (11 kB) Downloading virtualenv-20.26.2-py3-none-any.whl (3.9 MB) ---------------------------------------- 3.9/3.9 MB 8.9 MB/s eta 0:00:00 Downloading distlib-0.3.8-py2.py3-none-any.whl (468 kB) ---------------------------------------- 468.9/468.9 kB 2.7 MB/s eta 0:00:00 Downloading filelock-3.14.0-py3-none-any.whl (12 kB) Downloading platformdirs-4.2.2-py3-none-any.whl (18 kB) Installing collected packages: distlib, platformdirs, filelock, virtualenv Successfully installed distlib-0.3.8 filelock-3.14.0 platformdirs-4.2.2 virtualenv-20.26.2
@BiInsightsInc Місяць тому
Link to Ollama set up vidoe: ua-cam.com/video/CE9umy2NlhE/v-deo.html Link to complete AI series: hnawaz007.github.io/ai.html
@davidadejumo5057 2 місяці тому
please how do i access the file
@BiInsightsInc 2 місяці тому
Which file are you referring to? You create a sample project and dbt creates the project and the related files. If you want to review or download the dbt project then here the link to GitHub: github.com/hnawaz007/dbt-dw
@alexiojunior7867 2 місяці тому
wow these videos makes dbt a very easy tool to use
@voxdiary 2 місяці тому
why not put part numbers as part of title?
@BiInsightsInc 2 місяці тому
Good suggestion. I will add the P# to the tittle. Here is the link to the whole series: hnawaz007.github.io/mds.html
@amymorrison5615 2 місяці тому
👏👏👏👏

BI Insights Inc

КОМЕНТАРІ