OGIF office hours #37 - AI fine-tuning, data archiving, and datalake scaling with Notion

Topics and highlights

Session setup & check-in: Kicked off with a casual vibe, confirming no all-hands this week and setting up three talks. Phát skipped his slot, and the team troubleshooted screen sharing for demos.
AI fine-tuning overview: Explored fine-tuning vs. retraining, using a doctor’s note example to show how fine-tuning embeds knowledge while retraining leans on token-heavy prompts.
Fine-tuning demo: Showcased a Duty 40 Mini fine-tuning job on Open AI (~4800 tokens), comparing pre- and post-tuning results, with a nod to local vs. hosted model trade-offs.
Data archiving essentials: Biên broke down archiving vs. backup for apps with 50K-1M daily transactions, focusing on metadata, cloud storage, and recovery to optimize query performance.
Archiving tools & Q&A: Highlighted tools like AWS, Google Cloud, and Timescale, plus hot/warm/cold storage options (Azure, Backblaze), with audience questions on scheduling and platform quirks.
Datalake foundations: An traced datalakes from 1980s databases to today’s cloud systems, contrasting warehouse ETL (structured) with datalake ELT (raw data) workflows.
Notion’s datalake scaling: Detailed Notion’s growth to 96 instances and 400+ shards by 2023, shifting from warehouse to datalake with Debezium CDC, Kafka, Hudi, and S3 for analytics.
Interactive wrap-up: Fielded questions on datalake vs. replication, async processing, and external data handling (e.g., social media), ending with reflections on big data skillset.

Vietnamese transcript

[00:00] Hình như tuần sau mới có all-hand. Không thấy ai tạo event gì hết, chắc tuần này cứ bình thường thôi nhá. Anh em kiểm tra xem màn hình sharing có vấn đề gì không. Hay lên luôn nhỉ? Hôm nay chắc có ba bài thôi đâu đó. Phát vừa bảo tuần này cậu không có gì mới, chắc skip hôm nay rồi. Anh em thử share màn hình cá nhân xem sao nào. Xem trước được không?

[10:42] Theo lịch chắc anh nhỉ, để em lên trước nhá. Fine-tuning này, chủ đề này không mới lắm đâu. Bài này chỉ là 100.5 thôi, không phải 101, nên chỉ giới thiệu sơ sơ, chưa đi sâu được đâu. Tại em cũng mù mờ lắm, nên chắc giới thiệu sơ vậy thôi. Hôm nay em giới thiệu bài fine-tuning. Đây là agenda của bài này nè.

[12:24] Introduction là nếu mọi người dùng AI, chắc có nghe tới khái niệm fine-tuning rồi. AI có mấy mô hình đa số được fit vào dữ liệu từ một ngày nào đó, với mấy cái data privacy hoặc data của domain riêng. Dữ liệu này không xuất hiện trong knowledge của mô hình nền tảng (foundation model). Để mô hình có được kiến thức đó, người ta thường dùng retraining, đúng không? Nhưng còn một cách khác gọi là fine-tuning. Cuối bài, em sẽ so sánh hai cách này, xem lúc nào nên dùng cái nào, lúc nào không.

[13:08] Trước mắt, cứ hiểu fine-tuning như cách để mở rộng kiến thức cho mô hình nền tảng vậy. Fine-tuning là gì? Hiểu đơn giản là mọi người retrain lại mô hình, lấy một mô hình nền, đưa vào một dataset gì đó để fine-tune, nghĩa là retrain lại nó. Sau khi fine-tune xong, ta được một mô hình đã điều chỉnh, gọi là fine-tuned model.

[13:44] Tại sao fine-tuning mang được kiến thức mới? Hiểu đơn giản là trong một AI model, kiến thức được lưu qua mấy cái mạng nơ-ron. Fine-tuning sẽ cập nhật các weight, các thông số của mạng nơ-ron đó, để nó phù hợp với kiến thức mới. Khi ném kiến thức mới vào, mấy cái weight thay đổi, lúc này mô hình đã được cập nhật kiến thức rồi.

[14:19] Khi ném kiến thức mới vào, mấy cái weight thay đổi, lúc này mô hình đã được cập nhật kiến thức rồi. Khi fine-tune mô hình trong thực tế, không phải chỉ ném dataset vào rồi retrain là xong. Đúng là ra một mô hình fine-tuned, nhưng không biết cái mô hình sau retrain này có tốt hay không. Em sẽ giới thiệu một workflow mà bên ngoài thường dùng để fine-tune.

[14:59] Cái flow này, nó gồm nhiều bước như thế này. Mọi người có thể chia thành hai cụm, hai cụm nhá. Cái flow này, em sẽ chia thành hai cụm. Em đi qua cụm một trước. Cụm đầu tiên là cụm ở bên trái, hiểu đơn giản là một base model ban đầu. Sau đó, mọi người có một dataset mới, một cái gì đó mới, mọi người quăng vô, fine-tune nó. Rồi nó ra một model, mọi người sẽ supervise nó, có nghĩa là mọi người retrain nó dưới kiểu là retrain nó.

[15:38] Sẽ cho nó thêm kiến thức, với input này thì output sẽ ra như này. Nó sẽ ra một cái gọi là supervised fine-tuning. Sau đó, để em đi qua phần tiếp. OK, cái fine-tuning này là retrain on data, nghĩa là sẽ cho một cặp input-output trong dataset để nó học. Nó sẽ học được những kiến thức mới đó. Để bước này hoàn hảo, dataset phải được clean. Nó phải clean, không được lẫn với những cái khác. Nghĩa là nó phải specific cho cái domain mà mình muốn train nó.

[16:31] Quay lại hình này, sau khi mọi người có một cái model đã được retrain xong, mọi người mang lên production, mọi người dùng, đúng không? Lúc này, bên ngoài sẽ sử dụng một cái system gọi là human feedback. Kiểu như là response của model này có làm bạn hài lòng không, chấm từ 1 tới N sao, kiểu vậy á. Mọi người sẽ collect cái data đó. Nó nằm ở bước này, mọi người sẽ thu thập human feedback từ cái retrained model của mọi người.

[17:06] Dựa vào cái feedback đó, mọi người gọi bước này hơi tốn tài nguyên chút. Mọi người sẽ phải retrain một cái model riêng. Cái model này dùng để đánh giá xem response này được chấm bao nhiêu điểm. Kế đó, mọi người tới bước thứ ba, bước cuối. Ở bước này, mọi người sẽ dùng thuật toán như reinforcement learning để kết hợp với cái retrained model và cái reward model của mọi người.

[17:46] Mọi người retrain, mọi người lại fine-tune cái model một lần nữa. Nó sẽ ra cho mọi người một cái gọi là model tối ưu. Cái vòng lặp này cứ tiếp tục, tiếp tục mãi. Mọi người có cái model đã retrain xong, thu thập human feedback, rồi kết hợp ba cái đó để retrain cái model thêm lần nữa. Càng ngày, cái model sẽ càng ok hơn với những gì mà mình muốn. Đó là cái flow mà em thấy bên ngoài, trong production, người ta hay dùng.

[18:24] Trong quá trình fine-tuning, ,ọi người sẽ thường nghe tới khái niệm gọi là catastrophic forgetting. Nghĩa là sao? Nghĩa là khi mọi người retrain kiến thức mới vào, nó sẽ làm giảm performance với những kiến thức cũ. Tại sao chuyện này xảy ra? Như em đã nói, kiến thức của một model dựa vào mấy cái weight, dựa vào kiến trúc của cái model đó và những tham số của nó. Tham số dynamic trong một model là mấy cái weight. Khi mọi người retrain, mấy cái weight này thay đổi, đúng không?

[19:00] Khi nó thay đổi, có phải là kiến thức cũ sẽ bị giảm bớt độ chính xác đi không? Nếu trong dataset của mọi người có nhiều dữ liệu bị overfitting, nghĩa là dataset của mọi người quá đúng, quá đúng trong cái dataset đó. Khi một người quăng cái gì mới vào, nó sẽ sai với những cái cũ đi. Người ta gọi đó là overfitting, nghĩa là nó bị fold in quá mức vào những cái training data. Khi gặp data mới, nó sẽ giảm performance.

[19:42] Nên lúc này, bên ngoài người ta sử dụng một kỹ thuật gọi là parameter-efficient fine-tuning, gọi là PEFT. Nó có nhiều cách, nhiều kỹ thuật trong method này, như LoRA này kia. Nhưng trung quy, đa số bọn họ không phải update hết tất cả các weight trong cái model đó. Bọn họ sẽ chỉ đóng băng những layer nào không cần thiết. Họ sẽ đóng băng mấy cái layer không cần thiết, rồi chỉ update một số lượng nhất định các weight thôi. Để tránh trường hợp kiến thức cũ bị mất đi quá nhiều. Đó là cái cơ bản. Còn sâu hơn về mấy cái algorithm đằng sau, mọi người có thể tự tìm hiểu.

[20:27] Quay lại câu hỏi lúc ban đầu, ha, nó với retraining khác nhau thế nào, nên dùng cái nào? Có cái bảng đây, mọi người có thể dễ dàng nhận ra. Retraining là dữ liệu phụ thuộc vào database của mọi người. Cứ quăng vào, quăng vào, lúc nào data cũng được update liên tục. Còn fine-tuning là mọi người retrain lại model, nên lúc nào data cũng chỉ ở cái chỗ mà mọi người đã retrain thôi.

[21:16] Kế tiếp là customize and learning style. Nghĩa là cái retraining, mục đích của nó là cho mình một cái knowledge base để mình lấy mấy cái knowledge base đó ra tham chiếu, sử dụng. Còn fine-tuning thì sao? Nó upgrade cái não của model lên, để nó có sẵn cái knowledge đó luôn. Còn mấy cái ở dưới thì chắc mọi người tự tìm hiểu tiếp ha.

[21:57] Em có một cái ví dụ như vậy. Ví dụ như là mọi người muốn làm một cái system để giải thích những cái note của bác sĩ, đúng không? Những cái note của bác sĩ, mọi người có thể biết là những cái note của bác sĩ nó có rất là nhiều từ chuyên ngành. Và những từ chuyên ngành đó nó còn viết tắt, viết kiểu luộm thuộm nữa.

[22:44] Nếu mọi người sử dụng fine-tuning á, mọi người sẽ cho nó học hết tất cả những cái kiến thức luộm thuộm, những cái shorthand, những cái handwriting đó của bác sĩ. Nên khi mọi người input một cái note của bác sĩ vô, nó sẽ trả lời được rất đúng. Còn nếu mọi người dùng retraining á, khi mọi người input một cái note của bác sĩ vô, nó sẽ kiếm được những cái relevant data, mang ra đọc. Nhưng bản chất là cái model nó không hiểu được những từ đó, nên nó cũng sẽ không đưa cho mọi người một câu trả lời chính xác.

[23:16] Mọi người có thể hiểu như này: fine-tuning là mình nhờ một bác sĩ đọc một cái note của bác sĩ. Còn retraining là mọi người đưa cho một người có kiến thức rất rộng đọc một cái note của bác sĩ. Người đó có thể kiến thức rất rộng, nhưng về mấy cái chuyên ngành, mấy cái chuyên ngành thật sự, thì nó không đủ sâu như của một bác sĩ thực thụ. Nên độ chính xác sẽ không cao.

[23:57] Thứ hai, mọi người có thể nói, bây giờ với retraining, mình dùng một cái system prompt để list hết mấy cái shorthand của bác sĩ ra trong system prompt, nó sẽ tự hiểu thôi. Nhưng làm vậy, mọi người sẽ bị tốn token, đúng không? Tại vì khi mọi người dùng retraining, mọi người lấy hết cái retraining data ra, quăng một cái knowledge retraining vô, lại cộng thêm đống cái zero-shot, mấy cái description, mấy cái đi kèm theo nó trong một cái prompt á, thì nó rất tốn token.

[24:34] Và khi ở trong một cái long conversation với một cái model, nó đâu phải chỉ dựa vào câu hỏi của mình đâu. Nó sẽ dựa vào tất cả các cuộc trò chuyện từ trước tới giờ của mình mà nó trả lời cho mình. Lúc này, nó sẽ dẫn tới trường hợp là nó bị limit bởi token. Đó là cái drawback khi sử dụng retraining, là nó sẽ tốn token. Tại vì mọi người cần token để chạy cái system prompt của mọi người nữa. Còn fine-tuning, bản chất là model nó đã có kiến thức rồi, nên không cần phải có system prompt.

[25:14] Đó là sơ qua về fine-tuning. Chắc có cái demo cho mọi người xem sẽ rõ hơn ha. Bây giờ em sẽ fine-tune một cái model là Duty 40 Mini ha. Em có một cái dataset như này. Ừ, như này thì mỗi thứ nó sẽ có một cái system như retraining, rồi user hỏi cái này thì muốn nó trả lời vậy, đúng không? Em cộng 10 cái, 10 record trong cái dataset này, em sẽ fine-tune nó.

[26:13] Trước khi fine-tune, em sẽ cho nó chạy qua một đoạn code để em estimate được. Tại vì em dùng Open AI, nên sẽ tốn tiền. Nên mình sẽ tính được estimate là nó sẽ charge mình bao nhiêu. Em dùng xong, khúc cuối nó sẽ kiểu, tầm khoảng 4800, sắp xỉ 4800 token. Cái này chỉ là tham khảo thôi, nhưng em thấy nó cũng đúng. Sau đó, em sẽ upload cái file data này lên Open AI. Nó sẽ cho em cái file ở trên cái Open AI của em.

[27:03] Rồi em sẽ training nó. Em sẽ tạo một cái fine-tuning job. Lúc này, ở trên Open AI, nó sẽ chạy một cái job này. Mọi người có thể lên đây, mọi người đọc, mọi người quan sát. Nó sẽ không trả kết quả liền, nó sẽ tạo một cái job để pending ra đó, để trên Open AI nó fine-tune cho mình. Trong lúc chờ, mình có thể theo dõi quá trình của nó như thế nào. Sau khi xong đâu rồi, nó sẽ thông báo cho mình. Mình cứ stamp cái câu này, cứ check cái câu này để coi nó đã hoàn thành hay chưa. Mình đọc ở cái chỗ đó.

[27:58] Sau khi xong, nó sẽ cho mình mấy cái result status. Sau khi fine-tune xong, với cùng một câu hỏi, ví dụ đây là cái câu hỏi em sử dụng, em dùng câu hỏi này. Cái câu hỏi này gần giống với một cái record trong đống dataset của em. Sau khi em chạy, nó sẽ trả lời như vậy. Nhưng trước khi fine-tune, em dùng một cái model bình thường nha, model bình thường thì nó sẽ trả lời kiểu vậy.

[28:49] Có nghĩa là em fine-tune thì nó đã thành công. Đó là cái cách em sử dụng Open AI để fine-tune một cái model. Demo của em tới đây thôi. Anh em có câu hỏi gì không? Đúng rồi, cái này demo em xài tuning chứ để tự fine-tune bằng local mà xịn xịn thì chắc không đủ đồ. Dạ, đồ ngon nhõ thôi. Thực ra có mấy cái model trước, tô nó trên LoRA các thứ, cũng có thể demo được. Nhưng tô không, bài này easy, bài này kiểu một...

[29:47] Lẻ tẻ trầm mấy á. Đúng, chắc cũng ổn mà. Nói chung, những đội enterprise hay không muốn tốn thời gian xây dựng GPU thì sẽ dùng cách này. Diagram GPT hồi trước, GPT-4o Mini ra thì fine-tuning đã miễn phí, dùng cái này cũng tiện lợi cho họ. Cái cửa hàng demo cho anh em là sử dụng một cái như kiểu service ấy. Open AI cung cấp service fine-tuning, đưa lên mấy cái model của nó luôn. Mình pick mấy cái model, chắc là pick model mini á. Chắc chi phí nó không cao lắm.

[30:37] Đấy cũng là một cách. Nhưng vấn đề thực ra là mình vẫn không phải người own cái model đấy. Bản chất là vẫn host ở trên server của họ. Còn có một cách khác là tự build server và tự running. Trường hợp hôm nay đã khác. Anh em xem, hôm qua em có thử một cái model có 3 billion parameter thôi. Nhưng nó chạy hai ba tiếng, nó chưa xong đâu anh. Thực ra bài này, cái version nó đơn giản hơn một cái bây giờ, nhỏ hơn của bài trước. đầu.

[31:26] Nó là một cái full flow liên quan đến gì ta, reinforcement feedback. Ý là cái em giới thiệu ở bên ngoài production á, là khi người ta tuning á. Người ta không phải chỉ fine-tune xong là dùng liền, người ta phải đánh giá lại coi nó có đúng không. Người ta phải cho nó vô cái cycle để càng cải tiến cái fine-tuned model nữa, kiểu vậy. Đây là một cái flow như vậy. Bản chất nó cũng model thôi, đâu có gì đâu. Quan trọng là mọi người biết được những cái cost để đánh giá cái approach thôi.

[32:05] Tại ra nó vẫn là bài toán accuracy, đúng không? Mình chọn cách nào để làm cái output nó chính xác hơn. Những cái method như retraining hay fine-tuning, nó sẽ có những nhược điểm khác nhau. Và thực ra kể cả fine-tuning, nó cũng có nhiều method fine-tuning khác nhau. Chắc là cần đi sâu hơn để xác định mấy cái đó. Cái này vẫn hơi general. Chắc vậy, Hoàng. Nếu có điều kiện thì chắc đi sâu hơn tí nữa.

[32:48] Sâu hơn theo kiểu là có mấy cái method liên quan đến phần retraining các thứ. Sử dụng fine-tuning method á, có một số cái method nó tương đối tiết kiệm về mặt tài nguyên. Tất nhiên, nó sẽ đánh đổi với một số thứ khác, kiểu vậy. Giới thiệu cái đó để anh em xem thử đâu đó. Mọi người hỏi, Đạt hỏi là khi nào cần fine-tuning. Nói là fine-tuning cần khi mà mọi người muốn nó có một cái kiến thức, một cái specific topic nào đó. Mọi người có thể cân nhắc sử dụng fine-tuning.

[33:39] Nhưng trong tất cả trường hợp, em thấy bên ngoài, đa số mọi người sẽ prefer dùng retraining. Tại vì nó dễ và tốn ít tài nguyên hơn. Nhưng một số trường hợp như lúc này, cái ví dụ em nói về cái note của bác sĩ á, suppose là nên dùng tuning. Rồi tùy cái kiến trúc, tùy cái mình chia system của mình ra nhiều system nhỏ, system nhỏ nó như thế nào nữa, tùy. Có thể có một vài cái use case như kiểu chúng nó muốn host mấy cái model bé bé, model bé chẳng hạn.

[34:18] Chỉ kiểu dành để làm một cái task cụ thể thôi. Ví dụ như phân tích thời tiết, độ ẩm các thứ để perform cái action nào đấy. Ví dụ như thay đổi cái theme của điện thoại hay để chỉ action nào đấy chẳng hạn. Có thể retrain cái model bé bé để chỉ cần làm chuyện đó thôi, không cần phải cần network các thứ gì cả. Chắc vậy. Từ giờ chắc là Biên ha?

[36:13] Dụng cái và build cái recovery process cho nó. Chi tiết như nào thì nó sẽ có một vài phần chính. Trước tiên là cái lý do mà mình cần cái kỹ thuật này và so sánh nó với một cái quen thuộc hơn là backup. Sau đó là đi vào việc để mình build và những cái mình cần để ý, những cái gì. Đầu tiên là trên thực tế, thường có những tổ chức, những công ty mà chạy những cái app với lưu lượng dữ liệu cao á. Ví dụ như giao dịch chứng khoán này nọ. Như em ví dụ này là kiểu 50.000 transaction.

[37:12] Như em ví dụ này là kiểu 50.000 mỗi ngày là ít á, kiểu vọt lên 500.000, triệu transaction mỗi ngày. Sau khoảng thời gian, cái lượng data nó sẽ phồng lên rất rất lớn, ảnh hưởng đến cái việc mà mình query data và ảnh hưởng đến cái trải nghiệm người dùng. Trong những cái data đó, sẽ có những cái data mà dùng rồi thì nó rất ít được access lại nữa. Ví dụ như lịch sử trên 7 năm trước chẳng hạn. Nó sẽ dẫn đến một cái vấn đề, làm sao để mình giải quyết cái đống data đó. Nên mình mới dùng cái kỹ thuật là data archiving.

[38:14] Nó sẽ có những cái lợi ích để counter lại những chuyện bên trên. Đầu tiên là cái data mà mình sử dụng, mà nó set liên tục á, query ghi đọc liên tục á, thì nó thường tốn chi phí cao. Mình sẽ dùng cái kỹ thuật này, mình sẽ đem data của mình bỏ qua một cái chỗ khác, chi phí rẻ hơn, access ít hơn. Từ đó, nó sẽ làm tăng được cái performance của app của mình trong việc query hay aggregate data các thứ.

[39:07] Về mặt pháp lý hay reusable, những cái data đó nó sẽ được bảo vệ an toàn, không bị ảnh hưởng bởi những yếu tố bên ngoài. Để sau này khi mình dùng lại, mình có thể lấy ra dùng được. Như mọi người hay nói, mọi người sẽ liên tưởng đến cái data backup, thường dùng trong việc restore data, restore cái system hay app nếu có lỗi xảy ra. Mà hai thằng này, nó sẽ khác nhau ở chỗ là data backup á, nó sẽ dùng cho cái việc hotfix cái system nhiều hơn. Còn cái thằng archiving...

[39:59] Data archiving thì nó hướng về cái việc lưu trữ data một cách lâu dài. Nó sẽ có cái chi tiết so sánh như này. Để mình đi build một cái architecture, một cái system để archive data, xong rồi dùng nó để recovery lúc mình cần thì sẽ làm như sau. Mọi người thấy, nó sẽ có ba cái note chính. Thứ nhất là mình lưu data lại, mình dùng metadata để interact với những cái data đó, rồi mình bỏ lên một chỗ, ví dụ như những cái cloud-based service, cloud storage service, để mình lưu trữ cái data đó.

[40:51] Về chi tiết, để lưu trữ cái dữ liệu á, đầu tiên mình phải xác định những cái dữ liệu cần được lưu trữ. Phải phân tích xem dữ liệu nào hay được sử dụng, dữ liệu nào không được sử dụng, ít được truy cập. Sẽ có nhiều công cụ để mình làm những cái đó. Ví dụ như phân tích từ business requirement, hoặc từ các công cụ phân tích, mấy cái công cụ phân tích á. Từ đó, mình mới biết cái data nào là cần, cái nào có thể đem đi archive lại.

[42:05] Sau đó, mình sẽ gói nó lại, dùng một vài biện pháp như vector hóa nó, encode nó, rồi dùng checksum các thứ để đảm bảo cái data nó sẽ đúng. Sau này, khi mình sử dụng lại, mình truy cập lại một cách nhanh chóng. Tại vì những cái database này, nó gói lại ở một cái storage khác với cái mình hay set, nên mình cần phải lưu lại cái metadata của nó. Ví dụ như lưu theo tháng ha, hoặc lưu theo account, để sau mình query lại thì dễ hơn.

[43:07] Sau khi archive xong, mình muốn sử dụng lại á, thì vừa đây là cái ví dụ em để recovery. Mình sẽ tận dụng những cái metadata lúc nãy, mình search lại những cái block data mà mình cần, rồi đưa về cái môi trường tính toán lại nó khi cần thiết. Cái này nó có lợi ích là khi mình làm những chuyện này, nó sẽ không tác động đến cái data production của cái ứng dụng đang chạy. Mình có thể làm song song được. Mình muốn làm gì với nó thì làm, không chọc ngoáy vào trong cái production, sẽ đảm bảo an toàn được.

[43:51] Cho cái trải nghiệm người dùng, như sản phẩm của mình đó. Nói đến đây, có một vài cái practice cho việc sử dụng, xây dựng cái hệ thống này. Nó cũng đơn giản lắm nhỉ. Mình sẽ phải review lại những cái policy mà mình đặt ra để cái hệ thống này chạy, xem data nó có trọn vẹn hay không. Mình sẽ automation những cái step của cái process này. Hiện tại cũng có nhiều tool hỗ trợ mình rồi, ví dụ như AWS, hay Google, đều có những cái như...

[45:14] Google Cloud chẳng hạn. Mình chỉ cần viết những cái đơn giản để đẩy lên trên đó thôi. Và mình không thể thiếu cái monitoring để xem data này có hoạt động tốt hay không. Xong rồi, có những kỹ thuật khác như checksum này nọ, để đảm bảo data của mình luôn trọn vẹn. Khi mình cần, cũng sẽ có những chiến lược như schedule trước cái data. Tại vì những cái data này nó tồn tại lâu, nó cũng sẽ lớn, cũng sẽ phồng lên trên cái storage, cái cloud storage mà mình dùng để lưu trữ nó.

[46:05] Nên sẽ có những chiến lược như khi nào cần thì phải schedule trước, bao nhiêu thời gian đó để nó replicate data cho mình chẳng hạn. Kế của em chỉ như vậy thôi. Lý thuyết kiểu để giải quyết cái mục đích cuối cùng là nói mọi người về việc giải quyết những cái data tồn động lâu dài, nhưng không sử dụng đến nhiều trong cái hệ thống mà mình build thôi. Ví dụ như bên ngân hàng chẳng hạn, sẽ có kiểu user trade, trade của user nó lên đến cả trăm triệu record chẳng hạn.

[46:45] Sau này, nó sẽ lên nữa. Tức là query những cái data gần thôi, nhưng nó cũng rất tốn thời gian, kiểu vậy. Đó là những cái mà em nói hôm nay, hết. Mọi người có hỏi gì không? Khi mà store data, zip data, là mình sẽ zip một cái đoạn fragment trong quá khứ mà nó không sử dụng data đấy cho mục đích hiện tại, đúng không? Dạ, đúng rồi, đúng rồi. Đồng ý, việc em sẽ phải xóa. Khi xong, em phải xóa cái đó, đúng rồi. Nên mình sẽ có những cái load lại để tính toán khi cần.

[47:41] Nên mình mới có mấy cái kiểu để mình làm nó an toàn. Mình có hỏi kìa. Em chưa biết cái cơm của Thỏ có biết cái này không, so sánh được không? Đứng ra là Timescale, nó có cơ chế move chunk. Ví dụ là mình compression như bình thường thôi. Thêm về cái vụ là mình có hot, warm, và cold storage. Ví dụ mình backup hàng tuần thì để trên hot storage của Azure. Nếu là cũ quá, ví dụ 2, 3, 4, 5 năm, thì để trên cold storage của Azure, hoặc là Backblaze. Nó sẽ có riêng cái dịch vụ cho mình move cái chất data đó.

[48:53] Đúng cái vị trí object hoặc block storage, mình tương tác với Timescale để đảm bảo lúc mình cần tiết kiệm tiền với data cũ. Có thể tiết kiệm được, vốn có thể query, với trade-off là mình sẽ query hơi chậm với data hơi cũ thôi. Dạ, cái em hiểu là để tùy vào cái platform mình dùng để build ha anh. Ví dụ như bên Microsoft thì cũng sẽ có những cái tùy vào thời gian của database, hoặc tùy vào tuổi thọ của data, hay dung lượng này nọ, thì sẽ có những cái level khác nhau.

[49:40] Ví dụ nó sẽ có delay, hay bình thường vẫn access, hay delay cho những cái mà không dùng một thời gian lâu nữa. Cái đó là để mình cụ thể trên từng tool thôi. Còn chung chung, nó là anh đang cái này làm gì thì đứng ra là Timescale thì phù hợp cho cái kiểu pattern này, cho về time series. Bên phía Azure thì họ làm cho nó phù hợp với status, hơi giống như Timescale, nhưng nó kiểu giúp mình partition và shard đúng theo kiểu mình mong muốn.

[50:39] Mỗi một cái nó sẽ có ưu điểm, nhược điểm riêng. Với AWS thì đứng ra là với cái dịch vụ này thì phải coi chừng cái hardware cho lưu cái data này, nó có ổn định không. Ví dụ bên phía Azure cold storage thì nó dùng đĩa, đĩa gì ta, đĩa hơi khá đặc trưng. Phải dùng cái máy laser để in vào trong đó. Nên query rất nhanh, nhưng insert thì cũng hơi chậm, kiểu insert một đống cũng mất vài phút. Vì phải có một cái laser cứng để in ở trên đó, không có virtualization layer.

[51:23] Mỗi một service và mỗi một cái kiểu tool mình dùng cho compress và lưu trữ sẽ có ưu điểm, nhược điểm riêng, theo cái platform mình subscribe. Dạ, đúng rồi. Cái này không chỉ là mấy cái tool kiểu như AWS hay Google service. Nó là kiểu mình cũng có thể cân nhắc cho cái business của mình nữa. Nên cái này kiểu chung chung thôi. Còn từng platform, nó sẽ dùng những kỹ thuật khác nhau. Mục đích chung cuối cùng là để giải quyết cái vấn đề data nó lớn lên, nhưng ảnh hưởng đến cái việc mình query, mình nó chạy thôi

[52:16] Nhiều cách giải quyết cho câu chuyện optimize query, đúng không? Khi mà vấn đề là do data quá lớn, thì có một vài cách. Cách của biên là một cách, tức là sẽ có một phần data mình đang không xài đến, thì ta cắt đi ra, lưu đâu đấy. Về sau mà có cần đến past data thì insert lại xài sau. Còn mình để đâu đó tầm bao nhiêu phần trăm data hiện tại, đủ để xài mục đích hiện tại, query đi nó nhanh hơn.

[52:59] Còn một số cách khác thì xài thằng tooling, có một số kiểu database hay kiểu như Timescale, thì nó sẽ optimize luôn cho chuyện query với lượng data lớn lớn. Em nghĩ là bên dưới thì nó cũng sẽ tự động kiểu nó buff lên đâu đó, nó giữ giúp mình thôi, đúng không anh? Nên mình tỉ mỉ bên dưới, mình dùng là interface thôi. Cái bên dưới thì gần gần như nhau, như các em ta. Cảm ơn biên, vậy thôi. Chắc bài cuối của An, không biết có liên quan không. Không biết còn liên quan một tí gì đến cái cộng đồng viên không.

[54:00] Chắc có thì chắc cũng nói sơ sơ thôi, cũng không nhiều cái. Cũng gần giống như bài của Biên, nhưng use case cũng gần giống á. Nó mở rộng ra tí thôi. Tí rồi thì bài này là nói sơ về cái datalake với lại cái use case của thằng Notion. Mình nói cái datalake trước. Datalake thì chắc mọi người nghe miết rồi, xưa giờ cũng hơi lâu rồi đó. Mình nhìn lại cái quá trình phát triển của tụi datalake này, coi là mình đang đi tới đâu.

[54:54] Thật ra từ lúc bắt đầu, hồi tầm 1980 gì đó, là thời đại của thằng database, mấy thằng database warehouse, mấy cái mà mình đang xài hiện tại á. Về table các kiểu, tạo table rồi xử lý data. Sau này, tới cái đợt tầm năm 2000 các kiểu, tụi mấy thằng big tech bắt đầu thu thập data nhiều á. Rồi nó tận dụng mấy data đó, thì mới sinh ra mấy thằng để giải quyết vấn đề lưu trữ data và xử lý data trên dữ liệu lớn. Như là mấy dữ liệu lưu theo dạng file đồ á. Mấy cái này, mấy thuật ngữ như là cái MapReduce này nè.

[55:44] Hình như trong cái memo của mình có một bài về MapReduce. Nếu mọi người không biết thì có thể search lại, tìm đọc thử xem cái MapReduce hồi xưa nó làm cái gì. Nó là cái tiền thân của tuổi. Sau này nó tích hợp vô thôi, giờ không xài nữa, nhưng chắc là nó tích hợp sẵn hết rồi. Sau cái thời gian phát triển của thằng này, mới bắt đầu 2010, thì mới đẻ ra, trước 2010 tí, đẻ ra khái niệm về datalake, big data, cloud, là cái internal data warehouse á, trên cloud á. Nó cloud thôi.

[56:28] Sau này, cái đợt bây giờ á, thì nó bắt đầu phát triển hơn nữa, là về cái lake và datamart. Lake chắc bản chất là kết hợp giữa mấy cái của tụi datalake và cái warehouse thôi, để rồi đặt thành cái house. Như là mấy thằng như thằng Datadog, nó đang làm sao không biết, nhưng mình chắc là đang nói về cái này hơi đi sau thời đại tí. Để tập trung vào, chắc mình coi sơ một cái data architecture chung chung trước. Cái này, bữa cái bài của Tom có đăng, cũng có một cái diagram. Nó cũng tinh gọn hơn cái này, tinh gọn hơn tí, là cũng về cái data đi qua mấy cái layer, là processing rồi mới tới thằng gì đó.

[57:20] Cái này nó sẽ thể hiện rõ hơn tí, là trong một cái datalake, mình sẽ lưu những loại data gì. So với thằng data warehouse, mình chỉ lưu mấy thằng structured data thôi, hoặc là mấy cái như lưu table data clean hết rồi. Còn thằng datalake này thì nó raw data, nó sẽ cả structured, unstructured, semi-structured data luôn. Nó sẽ lưu dạng raw, sau đó nó mới xử lý data, transform data, rồi nó quăng qua cho cái đám bên BI analytics, hoặc là quăng vô cái warehouse khác để chứa cái data đã được process rồi á.

[58:18] Còn cái layer mà analytics sandbox này, thì nó là một cái layer để cho tụi data scientist, hoặc mấy thằng mà cần dùng cái raw data, process data, mà nó không ảnh hưởng tới cái process chính. Bên đây á, thì nó sẽ làm việc trên cái sandbox này để xử lý data cho tụi mấy thằng đó, mấy thằng cần raw data, nhưng không ảnh hưởng trực tiếp tới cái ruồng chính. Cái giống như hồi nãy Biên có nói á, có làm á đó, là nó sẽ lấy data, rồi nó lưu ở đâu đó để sử dụng sau này, hoặc để process gì đó không biết, nhưng mà nó không muốn ảnh hưởng tới process chính của cái app, thì nó sẽ là cái đống này.

[59:09] Ở chỗ này, mọi người thấy là mình có khái niệm là cái ETL á, là extract, transform, và load. Bên cái warehouse xưa giờ mình làm á, nó sẽ là extract, transform, và load, nó đi theo thứ tự đó luôn. Nhưng trong cái này, mình sẽ thấy rõ là cái thằng datalake á, nó sẽ là extract và load trước. Rồi sau khi nào cần á, nó bắt đầu process data, là transform. Transform sẽ đứng sau, load sẽ đứng trước. Đó là cái khác biệt giữa hai thằng.

[59:52] Đây là chỗ so sánh khác biệt giữa thằng data warehouse và datalake thôi. Đó là dữ liệu bên warehouse, nó được clean, structured, organized thành cái table. Còn thằng này thì nó lưu dạng file, raw data các thứ, semi-structured rồi đó, CSV hoặc mấy cái JSON. Cái process nó cũng sẽ khác nhau giữa thằng lake và lake này. Truy vấn thì thằng warehouse sẽ truy vấn bằng SQL, còn kia thì xử lý trực tiếp trên cái dữ liệu luôn. Mấy thằng hỗ trợ xử lý trực tiếp dữ liệu, như thằng Spark đó, thì nó sẽ hỗ trợ mấy cái đó. Nói qua về cái thằng Notion.

[01:00:46] Datalake thì cái use case của thằng Notion, mọi người biết là Notion mình xài cũng hơi nhiều rồi đó. Hồi xưa, nó cũng đi từ từ thôi. Mấy cái tổ chức, mấy cái block hồi xưa, nó tổ chức thì cũng kiểu data bình thường, giống như mình, là mấy cái app nhỏ nhỏ. Mấy cái block của nó bắt đầu tăng dần. Block của nó được hiểu là mấy cái gì, rồi nó sẽ bao gồm cái title trong trong đó. Nó sẽ gọi là block. Số lượng block của nó tăng lên liên tục theo ngày giờ.

[01:01:35] Gì đó thì bắt đầu sau này, nó phình ra, nó sẽ bắt đầu sử dụng mấy cái kỹ thuật như là sharding, sharding xưa. Như nhớ có bài của Hải Vũ có xe gì đó, nó scale horizontally. Nó bắt đầu tách ra sharding này nọ, rồi mấy cái instance. Trong giai đoạn từ 2021 đến 2023, nó sẽ có 32 instance. Mỗi instance sẽ có 15 cái shard. Rồi từ 2023 trở đi á, nó bắt đầu chia lại, nó lại tăng lên. Số lượng tăng lên nữa, thì đó là 96 cái instance. Và mỗi cái instance, nó sẽ là 5 cái shard. Nhân lên tầm 400 mấy á, bốn trăm mấy.

[01:02:27] Để mà xử lý thì lúc này, nó hơi to, đúng không? Khi mà data nó bắt đầu to lên á, nó sẽ có những nhu cầu. Sau này sẽ có những nhu cầu về cái analytics, hoặc là mấy cái về làm bên machine learning á, tập dữ liệu này nọ, mẹo mẹo rồi. Nó sẽ bắt đầu setup một cái data warehouse architecture của nó. Cái này là cái tiền thân trước khi setup cái datalake. Nó sẽ làm data warehouse để xử lý data. Cái luồng cơ bản của nó setup để thu thập data, mấy cái về thay đổi data của mấy cái block trong từng shard.

[01:03:21] Nó sẽ sử dụng cái file transfer để nó ingest mấy cái data từ mấy shard này nè. Nó đổ về cái gì, rồi nó gộp mấy thằng đó lại thành một cái single database to. Cái này sẽ gặp khó khăn trong việc là nãy mình nói, nó đang có khoảng bốn trăm mấy cái shard, đúng không? Nó sẽ gặp khó khăn trong việc là quản lý bốn trăm mấy connection thằng này. Xong rồi mấy cái khó khăn trong việc scaling. Số lượng data thay đổi trong mỗi cái block của thằng Notion, nó xảy ra thường xuyên và nó rất nặng, sẽ...

[01:04:13] Gây khó khăn trong việc đọc ghi trong cái table to này. Sau đó, nó mới bắt đầu setup một cái internal datalake của nó. Cái internal datalake này, có note là nó sẽ không thay thế thằng này hoàn toàn, mà nó chỉ sử dụng cái mới thôi. Còn cái này, nó vẫn tận dụng trong một vài tác vụ, kiểu nhẹ hơn, cho mấy cái table thay đổi data không có nặng lắm. Với lại nó cần cái gì. Còn thằng này, nó expect cái luồng này á, là nó sẽ đánh những cái data nó cần để cho những mục đích mà analytics hoặc là machine learning.

[01:05:08] Data nó có thể chấp nhận cái độ trễ là vài tiếng, vài phút, tiếng gì đó. Nó sẽ sử dụng cái data trong đây. Cái lượng setup thì cũng đơn giản thôi. Nó sẽ sử dụng cái thằng Debezium CDC này nè. Nó là cái capture data change á, để nó watch cái thằng database này, bắn về Kafka. Sau khi nó bắn cái đống event data change về Kafka, thì có một thằng bên đây là Hudi hay gì đó, nó lấy event đó, nó quăng về thằng S3. Rồi bắt đầu từ thằng này, thằng nào muốn sử dụng thì vô đây, nó lấy về, nó setup tiếp, xài data warehouse hoặc xài mấy cái chủ đích về shard gì đó, thì vô đây nó lấy, nó xài.

[01:05:51] Cái đó là cái thật ra, cái case Notion. Chắc là có thể xài thằng này thử. Vì nó cũng là cái thằng đứng ở ngoài, nó watch vô cái đống đó. Nếu mà xài AWS hay retraining á, sẽ xài cái một là cái thằng Redshift hay gì quên rồi. Nó sẽ watch thằng đó, những thay đổi trên cái database, xong rồi nó sẽ lưu hết về trong một cái bucket hay cái gì đó. Xong rồi từ đó, mình bắt đầu xử lý sau. Cái luồng bên này là có thể sử dụng cái này. Hồi nãy setup một cái demo, nhưng mà có vẻ hơi fail rồi.

[01:06:51] Tại vì nó chưa có được cái thằng server, nên là nó fail. Để sau đi, rồi chắc chỉ có như đó. Với lại có cái kiểu góc nhìn đó, là cái process này nè. Là cái process mà chắc tụi enterprise, nó sẽ có thể áp dụng. Nó là process kiểu chung chung mà đa số tụi enterprise sau này, em nghĩ là nó có thể. Nhu cầu của nó khi mà cái data lớn lên á, thì cái nhu cầu của nó cũng sẽ đi theo hướng này thôi. Đó là nó cần data, thu thập data để làm cái gì đó, và không ảnh hưởng tới cái luồng chính.

[01:07:52] Mình thì xưa giờ toàn focus vào cái việc làm việc với mấy cái model AI. Nhưng mà mình nghĩ là sau này, mình cũng cần cái skill set gì đó để mình biết cách xử lý những data như thế này, tụi mà nó data lớn hơn kiểu vậy. Cho xin lỗi cái, anh nào đây? Anh đang nhìn nhận cái process này, thì nó khác gì với chuyện là mình replicate cái database của mình ra một instance khác để phục vụ chuyện retraining ấy anh? Là tại vì ở đây, đứng ra là ý ở đây, thật ra là kiểu em đang sinh giống như kiểu sinh data sang một...

[01:08:54] Cái shard khác, đúng không? Data warehouse, đúng không? Và sử dụng upload kit process cho những cái tác vụ mà nó không, kiểu mình làm mình làm async được ấy, chứ không cần phải trực tiếp trên nguồn data chính. Câu hỏi là đối với cả mấy cái model dạng như sharding hay sử dụng master-slave ấy, thì sao không theo kiểu cứ duplicate cái database của mình ra thôi? Duplicate data thì nó vẫn chỉ là một cái data warehouse ở dưới dạng table ha. Còn thật ra cái này, nó chỉ là cái process, nghĩa là một process cho database thôi

[01:09:40]

. Nó có thể có những cái event khác. Như là ví dụ, mình sẽ có nhiều cái external data, không hẳn là mình chỉ có một cái battery, database không. Ví dụ mình có mấy cái capture như là từ social media, hoặc là mấy cái tụm lum la nào đó, chả biết. Nhưng mà nó có thể là nhiều loại data khác nhau, gom về, quăng qua thằng này. Thằng Hudi bạn này, nó sẽ là thằng chịu trách nhiệm xử lý cái raw data đó, để nó quăng vào cái thằng S3 này. Nó lưu...

[01:10:23]

Mọi thứ dưới cái đống này. Nó vô luôn, mọi thứ về dạng file gì đó, gom hết vô đây, để bắt đầu sau này, mấy thằng ngoài sao n mới có cái slot để xử lý. Thật ra tụi nó cũng có một câu hỏi là tại sao không dùng mấy thằng database như MySQL hay PostgreSQL á? Nó sẽ có mấy cái... Tại sao phải sử dụng cái thằng capture data change mà không sử dụng mấy thằng đó? Mấy thằng đó, nó có cơ chế để streaming mấy cái event change của nó luôn. Mấy thằng đó, event stream, nó thường sẽ stream trực tiếp từ database này qua database khác.

[01:11:09]

Còn thằng này, nó sẽ là capture cái event đó và đưa đâu cũng được. Vì là nếu mình không có thằng Kafka này ở đây á, thì mình cần một service nào đó, mình cần cái real-time data xử lý liền luôn á, mình không cần phải vô Kafka. Cái thằng CDC này vẫn có thể bypass qua đó được, kiểu kiểu vậy, chứ không hẳn là từ database sang database kiểu như vậy. Ta cũng có nhìn ý là kiểu, thấy là nếu mà theo góc nhìn về operation chẳng hạn ấy. Tất nhiên nếu mà có multiple datasource và sử dụng những cái partition tool các thứ, nó khác nhau ấy.

[01:11:50]

Database khác nhau, thì cái này cũng sẽ cả, thật ra là gom nó lại vào một cái datalake, sao cho một số cái tác vụ, nó cụ thể thôi ấy. Thực ra là một số team, như kiểu team AI hay team về mặt làm report, hay data, thì người ta cũng sẽ chỉ cần work trên cái data warehouse này thôi, kiểu vậy. Hoặc là có extend cho bên nào khác nữa, thì cũng sẽ make sense, phân vùng data riêng cho từng cái team riêng, đúng không? Có họ thêm cái vụ mà cái button, cái button ETL bên database bình thường với cả bên datalake, thì nó sẽ là ELT, đúng không?

[01:12:39]

Đúng rồi, ELT, anh sẽ hiểu là mình extract, nghĩa là mình tìm đúng file, đúng không? Mình load cái file đấy lên đây, và transform nó thành dạng kiểu structured data ha. Ý là nó sẽ transform, nó chỉ là cái action, nó xảy ra ở sau khi mình có raw data rồi. Còn ETL, nghĩa là extract là sẽ lấy data từ cái đám data source á. Xong rồi nó sẽ có cái quá trình log thẳng vào cái raw data, thẳng vào mấy cái gì đó của mình. Nó gọi là cái raw landing, cái layer raw landing. Xong rồi mình mới có cái gọi là transform.

[01:13:34] Sau đó, sau cái landing sẽ có transform để xử lý data, thì nó sẽ ra sau. Còn cái thằng kia là nó extract xong, rồi transform, nó mới quăng thẳng vào cái warehouse, đó là cái database của mình. Hay anh em có hỏi gì không? Rồi, cảm ơn An, đúng rồi. Anh em nhé, rồi bye anh em, mỗi tuần vui vẻ.

English transcript

[00:00] It seems like the all-hands meeting is next week. I don’t see anyone creating any events, so this week will probably just be normal, right? Guys, check if there’s any issue with screen sharing. Should we just start? Today, I think we’ll have about three presentations. Phát just said he doesn’t have anything new this week, so he’ll probably skip today. Guys, try sharing your personal screens and see how it goes. Can we preview it first?

[10:42] According to the schedule, it’s probably you, right, bro? Let me go first then. This fine-tuning topic, it’s not really that new. This presentation is just 100.5, not 101, so it’s only a brief intro, not going deep into it yet. Honestly, I’m pretty clueless about it too, so I’ll just give a quick overview. Today, I’ll present about fine-tuning. Here’s the agenda for this session.

[12:24] The introduction is, if you guys use AI, you’ve probably heard of the concept of fine-tuning. AI has these models that are mostly fitted to data from some specific day, with stuff like data privacy or data from a particular domain. That data doesn’t show up in the knowledge of the foundation model. To get that knowledge into the model, people usually use retraining, right? But there’s another way called fine-tuning. At the end of this, I’ll compare these two methods, looking at when to use which one and when not to.

[13:08] For now, just think of fine-tuning as a way to expand the knowledge of a foundation model. What is fine-tuning? Simply put, you retrain the model. You take a foundation model, feed it a dataset to fine-tune it, meaning you retrain it. After fine-tuning, you get an adjusted model, called a fine-tuned model.

[13:44] Why does fine-tuning bring in new knowledge? Simply put, in an AI model, knowledge is stored through neural networks. Fine-tuning updates the weights, the parameters of that neural network, to fit the new knowledge. When you throw new knowledge in, those weights change, and at that point, the model has updated its knowledge.

[14:19] When you throw new knowledge in, the weights change, and at that point, the model has updated its knowledge. When fine-tuning a model in practice, it’s not just about throwing a dataset in and retraining it. Sure, you get a fine-tuned model, but you don’t know if that retrained model is any good. I’ll introduce a workflow that people outside commonly use for fine-tuning.

[14:59] This flow, it’s got several steps like this. You can split it into two clusters, two clusters, alright? This flow, I’ll divide it into two clusters. I’ll go through the first cluster first. The first cluster is the one on the left, simply understood as a base model to start with. Then, you have a new dataset, something new, you throw it in, fine-tune it. Then it produces a model, and you supervise it, meaning you retrain it in a way that’s like retraining it.

[15:38] It’ll add more knowledge, with this input, the output will come out like this. It results in something called supervised fine-tuning. After that, let me move to the next part. Alright, this fine-tuning is retraining on data, meaning you give it a pair of input-output in the dataset for it to learn. It’ll pick up that new knowledge. For this step to be perfect, the dataset has to be clean. It has to be clean, not mixed with other stuff. Meaning it has to be specific to the domain we want to train it on.

[16:31]

Back to this diagram, after you have a model that’s been retrained, you bring it to production, you use it, right? At this point, people outside use a system called human feedback. It’s like, does the response from this model satisfy you? Rate it from 1 to N stars, something like that. You’ll collect that data. It’s part of this step—you’ll gather human feedback from that retrained model of yours.

[17:06] Based on that feedback, people call this step a bit resource-intensive. You’ll have to retrain a separate model. That model is used to evaluate how many points this response gets. Next, you move to the third step, the final one. In this step, you’ll use algorithms like reinforcement learning to combine it with the retrained model and your reward model.

[17:46] You retrain, you fine-tune the model one more time. It’ll give you something called an optimized model. This loop keeps going, going forever. You have a retrained model, collect human feedback, then combine those three things to retrain the model again. The more you do it, the better the model gets at what we want. That’s the flow I’ve seen people use out there in production.

[18:24] During the fine-tuning process, you’ll often hear about a concept called catastrophic forgetting. What does that mean? It means when you retrain with new knowledge, it reduces performance on the old knowledge. Why does this happen? As I said, a model’s knowledge depends on its weights, its architecture, and its parameters. The dynamic parameters in a model are those weights. When you retrain, those weights change, right?

[19:00] When they change, doesn’t that mean the old knowledge gets less accurate? If your dataset has a lot of overfitting data, meaning your dataset is too perfect, too perfect within that dataset, then when someone throws something new in, it’ll mess up the old stuff. They call that overfitting, meaning it’s folded in too much to the training data. When it encounters new data, performance drops.

[19:42] So at this point, people out there use a technique called parameter-efficient fine-tuning, or PEFT. It has lots of methods, techniques within this approach, like LoRA and stuff. But generally, most of them don’t update all the weights in the model. They’ll just freeze the layers that aren’t necessary. They freeze those unneeded layers and only update a certain number of weights. That’s to avoid losing too much of the old knowledge. That’s the basic idea. For deeper stuff about the algorithms behind it, you can look it up yourselves.

[20:27] Back to the question from the start, how’s it different from retraining, and which should we use? There’s a table here, you can easily see it. Retraining depends on your database. You keep throwing stuff in, throwing stuff in, and the data gets updated constantly. But with fine-tuning, you retrain the model, so the data stays only where you retrained it.

[21:16] Next is customize and learning style. Meaning, retraining’s purpose is to give us a knowledge base that we can pull from to reference and use. But fine-tuning? It upgrades the model’s brain, so it has that knowledge built-in already. The stuff below that, you guys can probably look into it more yourselves, yeah?

[21:57] I’ve got an example like this. For instance, say you want to build a system to explain doctors’ notes, right? Doctors’ notes, as you might know, have tons of technical terms. And those technical terms are often abbreviated, written all sloppy too.

[22:44] If you guys use fine-tuning, you’ll make it learn all that messy knowledge, the shorthand stuff, the handwriting stuff from doctors. So when you input a doctor’s note, it’ll give you a really accurate answer. But if you use retraining, when you input a doctor’s note, it’ll find the relevant data and pull it up to read. But the thing is, the model doesn’t actually understand those terms, so it won’t give you an accurate answer either.

[23:16] You can think of it like this: fine-tuning is like asking a doctor to read a doctor’s note. Retraining is like giving it to someone with really broad knowledge to read a doctor’s note. That person might know a ton, but when it comes to the specialized stuff, the real technical terms, they don’t have the depth of an actual doctor. So the accuracy won’t be high.

[23:57] Second, you might say, alright, with retraining, we can use a system prompt to list out all the doctor’s shorthand in the system prompt, and it’ll figure it out on its own. But doing that, you’ll end up using a lot of tokens, right? Because when you use retraining, you pull out all the retraining data, throw in a retraining knowledge base, plus a bunch of zero-shot stuff, descriptions, and whatever else goes with it in a prompt, that takes up a ton of tokens.

[24:34] And when you’re in a long conversation with a model, it’s not like it only relies on your question. It bases its answers on all the conversations you’ve had with it from the start. At that point, it leads to a situation where it’s limited by tokens. That’s the drawback of using retraining—it eats up tokens. Because you need tokens to run your system prompt too. But with fine-tuning, the thing is, the model already has the knowledge, so you don’t need a system prompt.

[25:14] That’s a quick rundown on fine-tuning. Probably having a demo for you guys to see would make it clearer, yeah? Now I’ll fine-tune a model called Duty 40 Mini. I’ve got a dataset like this. Yup, like this, each thing has a system like retraining, then the user asks this and wants it to answer like that, right? I’ve got 10 things, 10 records in this dataset, and I’ll fine-tune it.

[26:13] Before fine-tuning, I’ll run it through a piece of code so I can estimate it. Since I’m using Open AI, it’ll cost money. So we’ll calculate an estimate of how much it’ll charge me. After I run it, at the end it’s like, around 4800, close to 4800 tokens. This is just a reference, but I think it’s pretty accurate. Then I’ll upload this data file to Open AI. It’ll give me the file up on my Open AI account.

[27:03] Then I’ll train it. I’ll create a fine-tuning job. At this point, on Open AI, it’ll run this job. You guys can go up here, read it, check it out. It won’t give results right away, it’ll create a job and leave it pending there, so Open AI can fine-tune it for us. While waiting, we can track how the process is going. Once it’s done, it’ll notify us. We just keep stamping this sentence, checking this sentence to see if it’s finished or not. We read it from that spot.

[27:58] Once it’s done, it’ll give us some result statuses. After fine-tuning, with the same question. For example, this is the question I used, I used this question, it’s pretty close to one of the records in my dataset pile. After I run it, it’ll answer like this. But before fine-tuning, I used a normal model, just a regular model, and it answered like that.

[28:49] Meaning, when I fine-tuned it, it worked. That’s how I used Open AI to fine-tune a model. That’s it for my demo. Any questions, guys? Yeah, for this demo, I used tuning, but to do fine-tuning locally with something fancy, I probably don’t have the gear. Yup, just small-time gear. Actually, with some earlier models, I ran them on LoRA and stuff, and they could’ve been demoed too. But I didn’t, this one’s easy, it’s like a basic one.

[29:47] A bit scattered and slow, huh? Yeah, it’s probably fine though. Generally, enterprise teams or those who don’t want to spend time building GPUs will use this method. Back with Diagram GPT, when GPT-4o Mini came out, fine-tuning was free, so using this was pretty convenient for them. The demo shop for you guys is using something like a service. Open AI provides a fine-tuning service, putting it right up on their models. We pick some models probably the mini ones, I guess. The cost probably isn’t too high.

[30:37] That’s one way to do it. But the issue is, we’re still not the ones owning that model. The thing is, it’s still hosted on their server. There’s another way, like building your own server and running it yourself. Today’s case was different. Check it out, guys, yesterday I tried a model with just 3 billion parameters. But it ran for two or three hours and still wasn’t done, bro. Actually, this one, its version is simpler than what we have now, smaller than the previous one from earlier.

[31:26] It’s a full flow related to what was it reinforcement feedback. The point is, what I introduced about production out there is when people tune stuff. They don’t just fine-tune it and use it right away, they have to evaluate it again to see if it’s right. They put it into a cycle to keep improving the fine-tuned model, something like that. This is one of those flows. At its core, it’s just a model, nothing special. The key is you guys knowing the costs to judge the approach.

[32:05] Because it’s still about accuracy, right? We pick a method to make the output more accurate. Methods like retraining or fine-tuning they’ve got different downsides. And honestly, even with fine-tuning, there are lots of different fine-tuning methods. We’d probably need to dig deeper to figure those out. This is still kinda general. Probably so, Hoàng. If we’ve got the chance, we should dive a bit deeper.

[32:48] Deeper in the sense of looking at methods related to retraining and stuff. With fine-tuning methods, some of them are pretty resource-efficient. Of course, there’s a trade-off with some other stuff, like that. I’m introducing this so you guys can check it out somewhere. People asked—Đạt asked when we need fine-tuning. I’d say fine-tuning is needed when you want it to have knowledge on a specific topic. You can consider using fine-tuning then.

[33:39] But in all cases, from what I’ve seen out there, most people prefer retraining. Because it’s easier and uses fewer resources. But in some cases, like right now, the example I gave about doctors’ notes, I’d suppose tuning is better. Then it depends on the architecture, how we split our system into smaller systems, what those smaller systems are like it varies. There might be some use cases where they want to host small models, tiny ones maybe.

[34:18] Just for doing a specific task. Like analyzing weather, humidity, stuff like that, to perform some action. For example, changing your phone’s theme or triggering some action or whatever. You could retrain a small model just for that, no need for a network or anything fancy. Probably like that. From now on, it’s Biên’s turn, yeah?

[36:13] Using it and building a recovery process for it. How it works in detail, it’s got a few main parts. First is the reason we need this technique and comparing it to something more familiar like backup. Then it’s about how we build it and the things we need to watch out for, what stuff. To start, in reality, there are often organizations, companies running apps with high data traffic. Like stock trading stuff, for example. My example here is something like 50,000 transactions.

[37:12] My example is like 50,000 a day is low it could shoot up to 500,000, a million transactions a day. After a while, that data volume swells up huge, affecting how we query data and impacting the user experience. In that data, there’s stuff that, once used, barely gets accessed again. Like history from over 7 years ago, for instance. That leads to a problem how do we deal with that pile of data? So we use a technique called data archiving.

[38:14] It’s got benefits to counter those issues up there. First off, the data we use, the stuff that’s constantly being set, queried, read, and written all the time, it usually costs a lot. With this technique, we take our data and move it somewhere else, somewhere cheaper with less access. That way, it boosts our app’s performance when querying or aggregating data and such.

[39:07] In terms of legal stuff or reusability, that data will be kept safe, not affected by external factors. So later, when we need to use it again, we can pull it out and use it. As people often say, you’ll think of data backup, which is usually used to restore data, restore the system, or the app if something goes wrong. But these two things are different in that data backup is more for hotfixing the system. As for archiving data archiving focuses on storing data long-term.

[39:59] It has a detailed comparison like this. To build an architecture, a system to archive data, and then use it for recovery when we need it, here’s how it works. You guys see, it has three main notes. First, we store the data, we use metadata to interact with that data, then we put it somewhere, like cloud-based services or cloud storage services, to keep that data stored.

[40:51] In detail, to store the data, first we have to figure out which data needs to be stored. We need to analyze which data gets used a lot, which doesn’t get used, or gets accessed rarely. There are lots of tools to help us do that. For example, analyzing from business requirements or using analytics tools, those analytics tools. From there, we figure out which data is necessary, which can be archived.

[42:05] After that, we’ll package it up, using a few methods like vectorizing it, encoding it, then using checksums and stuff to make sure the data stays correct. Later, when we use it again, we can access it quickly. Because these databases are packaged in a storage different from what we usually set, we need to save its metadata. For instance, store it by month, yeah, or by account, so it’s easier to query later.

[43:07] After archiving, when we want to use it again, just now I gave an example for recovery. We’ll use that metadata from earlier, search for the data blocks we need, then bring it back to the computing environment when necessary. The benefit here is that when we do this stuff, it doesn’t mess with the production data of the running app. We can do it in parallel. Whatever we want to do with it, we do, without poking around in production, so it keeps things safe.

[43:51] For the user experience, like our product. Speaking of this, there are a few practices for using and building this system. It’s pretty simple, right? We’ll have to review the policies we set up for this system to run, check if the data stays intact. We’ll automate the steps of this process. Nowadays, there are plenty of tools supporting us already, like AWS or Google, they’ve got stuff like...

[45:14] Google Cloud, for example. We just need to write some simple stuff to push it up there. And we can’t skip monitoring to see if this data is working well or not. Then, there are other techniques like checksums and such, to ensure our data always stays intact. When we need it, there’ll also be strategies like scheduling the data beforehand. Because this data sticks around for a long time, it’ll grow big, it’ll swell up in the storage, the cloud storage we use to keep it.

[46:05] So there’ll be strategies like, when we need it, we have to schedule in advance, how much time it’ll take to replicate the data for us, for instance. That’s my plan, that’s it. The theory is kind of to address the ultimate goal of explaining to you guys about handling data that sticks around long-term but isn’t used much in the system we build. Like in banking, for example, there’s stuff like user trades, user trades hitting hundreds of millions of records or something.

[46:45] Later on, it’ll grow even more. Meaning querying just the recent data, but it still takes a ton of time, something like that. That’s what I talked about today, done. Any questions, guys? When we store data, zip data, it’s like we zip up a fragment from the past that doesn’t use that data for current purposes, right? Yup, exactly, exactly. Agreed, I’ll have to delete it. Once it’s done, I’ve got to delete that, yeah. So we’ll have ways to reload it for calculations when needed.

[47:41] That’s why we’ve got these methods to keep it safe. I’ve got a question over here. I don’t know if Thỏ’s crew knows about this, can we compare it? Standing out is Timescale, it’s got a chunk-moving mechanism. For example, we compress it like normal. Plus, there’s this thing about having hot, warm, and cold storage. Like, if we back up weekly, it goes on Azure’s hot storage. If it’s too old, say 2, 3, 4, 5 years, then it’s on Azure’s cold storage or Backblaze. It’s got a separate service for us to move that data stuff.

[48:53] Right to the object or block storage spot, we interact with Timescale to make sure we save money with old data when we need to. It can save costs, it can still query, with the trade-off being that querying is a bit slow with older data. Yup, what I get is it depends on the platform we use to build, right, bro? For example, with Microsoft, it’ll depend on the database’s timing or the data’s lifespan or capacity and stuff, so there’ll be different levels.

[49:40] For example, there’ll be delays, or normal access still, or delays for stuff that hasn’t been used in a long time. That’s for us to specify on each tool. But generally, it’s like, bro, what’s this doing? Standing out is Timescale, it fits this kind of pattern, for time series stuff. On Azure’s side, they make it fit the status, kinda like Timescale, but it helps us partition and shard the way we want.

[50:39] Each one has its own pros and cons. With AWS, standing out is that with this service, you’ve got to watch the hardware storing this data, whether it’s stable or not. For example, Azure’s cold storage uses disks, what kind of disks, some pretty unique ones. They’ve got to use a laser machine to burn it in there. So querying is super fast, but inserting is kinda slow, like inserting a bunch takes a few minutes. Because it needs a hard laser to burn it on there, no virtualization layer.

[51:23] Each service and each type of tool we use for compressing and storing has its own pros and cons, depending on the platform we subscribe to. Yup, exactly. This isn’t just about tools like AWS or Google services. It’s like we can also weigh it for our business too. So this is kinda general. Each platform uses different techniques. The ultimate common goal is to tackle the issue of data growing big but affecting how we query, how it runs.

[52:16] Lots of ways to solve the query optimization problem, right? When the issue is that the data’s too big, there are a few approaches. Biên’s way is one approach, meaning there’s a chunk of data we’re not using, so we cut it out, store it somewhere. Later, if we need past data, we insert it back to use it. For now, we keep some percentage of current data, enough for current purposes, so querying is faster.

[52:59] Other ways use tooling, some database types, like Timescale, optimize querying for huge data right off the bat. I think underneath, it kinda auto-buffs it somewhere, holds it for us, right, bro? So we fuss over the details underneath, we just use the interface. Underneath, it’s pretty much the same, like us kids. Thanks, Biên, that’s it. Probably An’s last piece, not sure if it’s related. Not sure if it ties a bit to the community stuff.

[54:00] If there is, it’s probably just a quick rundown, not a lot. Pretty similar to Biên’s piece, but the use case is kinda close too. It expands a bit more. Alright, so this piece is a quick talk about datalakes and Notion’s use case. Let’s talk datalakes first. Datalakes, you guys have probably heard about them tons, been around for a while now. Let’s look back at how these datalakes evolved, see where we’re at.

[54:54] Actually, from the start, around 1980 or something, it was the era of databases, database warehouses, the stuff we’re using now. Table stuff, creating tables and processing data. Later, around the 2000s or so, the big tech folks started collecting tons of data. They used that data, so new stuff popped up to handle storing and processing data on big datasets. Like data stored as files and such. These things, terms like MapReduce, for instance.

[55:44] I think in my memo, there’s an article on MapReduce. If you guys don’t know, you can search it up, check it out to see what MapReduce did back then. It was the ancestor of this era. Later, it just got integrated in, not used standalone anymore, but it’s probably all built-in now. After that development phase, around 2010, it started giving birth, a bit before 2010, to concepts like datalakes, big data, cloud, internal data warehouses on the cloud. It’s just cloud stuff.

[56:28] Now, these days, it’s evolving further, into lakes and datamarts. Lakes are probably just a mix of datalake stuff and warehouses, then turned into a house. Like Datadog or whatever they’re doing, I don’t know, but we’re probably talking about this a bit behind the times. To focus in, let’s take a quick look at a general data architecture first. This one, Tom’s piece the other day posted it, had a diagram too. It’s a bit more streamlined than this, a bit more concise, about data going through layers, processing then to some other thing.

[57:20] This one shows it a bit clearer, about what kinds of data we store in a datalake. Compared to a data warehouse, we only store structured data, or stuff like table data that’s all cleaned up. But this datalake, it’s raw data, it’ll handle structured, unstructured, semi-structured data all together. It stores it raw, then it processes the data, transforms it, and tosses it over to the BI analytics crew or into another warehouse to hold the processed data.

[58:18] Then there’s this analytics sandbox layer, which is a layer for data scientists or folks who need to use raw data, process data, without messing with the main process. Over here, they’ll work on this sandbox to handle data for those guys, the ones who need raw data but don’t directly affect the main flow. It’s like what Biên said earlier, doing that stuff, taking data and storing it somewhere for later use or to process something, I don’t know, but it doesn’t want to mess with the app’s main process, so it’s this pile.

[59:09] Here, you guys see we’ve got this concept called ETL, extract, transform, and load. With warehouses, what we’ve done so far is extract, transform, and load, it follows that order straight up. But in this one, you’ll clearly see the datalake does extract and load first. Then when it’s needed, it starts processing the data, that’s transform. Transform comes after, load comes before. That’s the difference between the two.

[59:52] This is just the spot comparing the differences between data warehouses and datalakes. With warehouses, the data is cleaned, structured, organized into tables. But this one stores it as files, raw data and stuff, semi-structured already, CSV or JSON files. The processing is different between this lake and that lake too. Querying, the warehouse uses SQL, while over there, it processes directly on the data itself. Tools that support direct data processing, like Spark, handle that stuff. Moving on to Notion.

[01:00:46] For datalakes, the use case of Notion, you guys know we’ve been using Notion quite a bit already. Back in the day, it started slow. The organizations, the blocks from before, they were organized like normal data, just like us, small apps. Its blocks started growing gradually. Blocks are understood as what, and they’ll include the title in there. They call it a block. The number of blocks keeps increasing constantly by the day and hour.

[01:01:35] Something like that, then later it started swelling up, and it began using techniques like sharding, old-school sharding. Like I remember Hải Vũ’s article mentioning something about it, scaling horizontally. It started splitting into sharding and stuff, then instances. From 2021 to 2023, it had 32 instances. Each instance had 15 shards. Then from 2023 onward, it started splitting again, increasing even more. The number went up again, so that’s 96 instances. And each instance had 5 shards. Multiply that, it’s around 400-something, four hundred and some.

[01:02:27] To handle that, at this point it’s pretty big, right? When the data starts getting big, it’ll have some needs. Later on, it’ll have needs for analytics or stuff related to machine learning, datasets and tricks and all that. It started setting up its own data warehouse architecture. This was the precursor before setting up the datalake. It did a data warehouse to process data. The basic flow it set up was to collect data, stuff about data changes in the blocks across each shard.

[01:03:21] It used file transfers to ingest the data from these shards here. It dumped it into something, then merged those things into one big single database. This ran into issues because, like I said earlier, it’s got about four hundred-something shards, right? It struggled with managing four hundred-something connections for this thing. Plus the scaling challenges. The amount of data changing in each block of Notion happens often and is super heavy, so it made reading and writing in this big table tough.

[01:04:13] After that, it started setting up its own internal datalake. This internal datalake has a note that it won’t completely replace this one, it just uses the new stuff. The old one, it still uses for some tasks, lighter ones, for tables where data changes aren’t too heavy. And it needs something. But this one, it expects this flow to tag the data it needs for purposes like analytics or machine learning.

[01:05:08] The data can handle a delay of a few hours, a few minutes, something like that. It’ll use the data in here. The setup amount is pretty simple. It uses this thing, Debezium CDC, you know. It’s the capture data change thing, to watch this database and shoot it over to Kafka. After it shoots that pile of event data changes to Kafka, there’s a thing over here, Hudi or something, that grabs those events and tosses them to S3. Then from this point, whoever wants to use it goes in here, grabs it, sets it up further, uses it for data warehouses or some shard-related purposes, they take it from here and use it.

[01:05:51] That’s actually the Notion case. We could probably try using this thing. Because it’s also one of those that stands outside, watching that pile. If we use AWS or retraining, it’d use something like Redshift or whatever, I forget. It’d watch that, the changes on the database, then save it all into a bucket or something. From there, we start processing afterward. This flow here could use that. Earlier, I set up a demo, but it kinda flopped.

[01:06:51] Because it didn’t have the server yet, so it failed. Let’s leave it for later, probably just that much for now. Plus there’s this perspective, this process here. It’s a process that enterprise folks could probably apply. It’s a kinda general process that most enterprises later on, I think, might use. Their needs, when the data grows big, will probably head in this direction anyway. It’s that they need data, collect data to do something, without messing with the main flow.

[01:07:52] For us, so far we’ve always focused on working with AI models. But I think later on, we’ll also need some skill set to know how to handle data like this, stuff where the data’s bigger, you know. Sorry, which bro is this? You’re looking at this process, how’s it different from us replicating our database to another instance for retraining purposes, bro? Because here, standing out, the point is, it’s like I’m kinda generating data to another different shard, right?

[01:08:54] And using an upload kit process for tasks that don’t, like, we can do async, you know, instead of needing to work directly on the main data source. The question is, for all those models like sharding or using master-slave setups, why not just duplicate our database? Duplicating data, it’s still just a data warehouse in table form, yeah. But actually this, it’s just a process, meaning a process for the database.

[01:09:40] It could have other events. Like, for example, we’ll have lots of external data, not just one battery, a database, you know. Say we’ve got captures from social media or some random messy stuff, who knows. But it could be lots of different data types, gathered up, tossed to this thing. This Hudi buddy here, it’s the one responsible for processing that raw data, to throw it into this S3 thing. It stores everything under this pile.

[01:10:23] It goes right in, everything in some file format, all dumped in here, so later on, the outside folks have a slot to process it. Actually, they had a question too, why not use databases like MySQL or PostgreSQL? They’ve got their own... Why use this capture data change thing instead of those? Those ones, they’ve got mechanisms to stream their event changes already. With them, event streams usually stream straight from one database to another.

[01:11:09] But this one, it captures that event and sends it wherever. Because if we don’t have this Kafka here, we’d need some service, we’d need real-time data processed right away, without going through Kafka. This CDC thing can still bypass to there, kinda like that, not exactly from database to database like that. We also noticed something, like, it feels like, from an operations angle, for instance. Of course, if there are multiple datasources and we use partition tools and stuff, they’re different.

[01:11:50] Different databases, so this will also, actually, bundle it into a datalake, so some tasks are specific, you know. Actually, some teams, like the AI team or the reporting team or data folks, they’d probably just need to work on this data warehouse, like that. Or if it extends to other sides too, it’d make sense, splitting data zones for each team separately, right? They added this thing about the button, the ETL button in regular databases versus datalakes, it’d be ELT, right?

[01:12:39] Yup, ELT, you’d get it as extract, meaning we find the right file, right? We load that file up here and transform it into something like structured data, yeah. The idea is it transforms, it’s just an action, it happens after we’ve got the raw data. But ETL means extract is pulling data from the data source pile. Then it’s got a process to log straight into the raw data, straight into some stuff of ours. They call it raw landing, the raw landing layer. Then we’ve got what’s called transform.

[01:13:34] After that, after the landing, there’s transform to process the data, so it comes out later. But the other one extracts, then transforms, then tosses it straight into the warehouse, that’s our database. Any questions, guys? Alright, thanks An, yup. See you, guys, bye, have a fun week

Properties

Location

Stats

OGIF office hours #37 - AI fine-tuning, data archiving, and datalake scaling with Notion

Topics and highlights

Vietnamese transcript

English transcript

Subscribe to Dwarves Memo