Pham Thanh

thanhpn

💬 Senior Full-stack and Blockchain Engineering

Avatar
Activity
JanFebMarAprMayJunJulAugSepOctNovDec
1 activities in 2024
LessMore
Pinned memos

Proximal policy optimization (PPO) is an algorithm that aims to improve the stability of training by avoiding overly large policy updates. It is a popular and effective method used for training [[Reinforcement Learning | reinforcement learning]] models in complex environments. To achieve this, PPO uses a ratio that indicates the difference between the current policy and the old policy and clips this ratio within a specific range, ensuring that the policy updates are not too large and the training process is more stable...

A Reward model is a critical component in Reinforcement Learning for Large Language Models (LLMs), designed to evaluate and score the quality of generated responses. It plays a key role in aligning LLMs with human values and improving their output through iterative refinement.

An introduction to Q-learning, a model-free reinforcement learning algorithm used to learn optimal policies in Markov Decision Processes.

July 2024
June 2023
Published Reward modelJune 23
Published Q learningJune 22
May 2023
February 2023
Published Plonky2February 28
January 2023
Published Polygon zkEVM architectureJanuary 03
December 2022
Published StarkNet architectureDecember 26
August 2022
Published Multisign walletAugust 10
July 2022
Published Anchor frameworkJuly 01
June 2022
Published Blockchain bridgeJune 21
Dwarves Foundation
Memo