Showing 21–40 of 6090 insights
| Title | Episode | Published | Category | Domain | Tool Type | Preview |
|---|---|---|---|---|---|---|
| Evaluations as Core | EP 21 Kimi k2 Thinking, The AI Bubble, Nvidia’s Future, and LangChain Experiments | 11/22/2025 | Opinions | Ai-development | - | The real power of LangChain (or Langsmith) lies in its robust evaluation and experimentation capabilities, which often go overlooked amid other featur... |
| Comparative Experimentation | EP 21 Kimi k2 Thinking, The AI Bubble, Nvidia’s Future, and LangChain Experiments | 11/22/2025 | Frameworks | Ai-development | - | Use an LLM as a judge to compare outputs from multiple experiments or models, facilitating side-by-side evaluation and selection of the best performin... |
| LLM Feedback Loop | EP 21 Kimi k2 Thinking, The AI Bubble, Nvidia’s Future, and LangChain Experiments | 11/22/2025 | Frameworks | Ai-development | - | Integrate cloud code with Langsmith via mcp to run experiments, inspect traces, modify code, and rerun in a tight feedback loop for rapid AI developme... |
| Building Evaluator Hooks | EP 21 Kimi k2 Thinking, The AI Bubble, Nvidia’s Future, and LangChain Experiments | 11/22/2025 | Frameworks | Ai-development | - | Use evaluator hooks on GitHub to automatically run AI evaluations on each commit, enabling continuous feedback on prompt performance. |
| OpenAI Playground & SDK | EP 21 Kimi k2 Thinking, The AI Bubble, Nvidia’s Future, and LangChain Experiments | 11/22/2025 | Products | Ai-development | Ai-service | Interfaces and SDKs provided by OpenAI to run prompt experiments, evaluate outputs, and integrate LLM evaluations into workflows. |
| Anthropic Haiku Model | EP 21 Kimi k2 Thinking, The AI Bubble, Nvidia’s Future, and LangChain Experiments | 11/22/2025 | Products | Ai-development | Ai-service | Anthropic’s new Haiku language model, suggested for comparative experiments against other LLMs like GPT5 mini. |
| GPT5 Mini Model | EP 21 Kimi k2 Thinking, The AI Bubble, Nvidia’s Future, and LangChain Experiments | 11/22/2025 | Products | Ai-development | Ai-service | An experimental miniature version of GPT-5 usable in the Playground, Studio links, or via SDK for prompt performance comparisons. |
| LLM Arenas Pairwise | EP 21 Kimi k2 Thinking, The AI Bubble, Nvidia’s Future, and LangChain Experiments | 11/22/2025 | Products | Ai-development | Ai-service | A tool (aka LLM arenas) for conducting pairwise comparisons between model responses, enabling automated choice of better outputs. |
| OpenAI LLM Evaluation | EP 21 Kimi k2 Thinking, The AI Bubble, Nvidia’s Future, and LangChain Experiments | 11/22/2025 | Products | Ai-development | Ai-service | OpenAI’s API-based evaluation framework offering out-of-the-box LLM judge evaluations to score and compare model outputs automatically. |
| ChatGPT Prompt Dataset | EP 21 Kimi k2 Thinking, The AI Bubble, Nvidia’s Future, and LangChain Experiments | 11/22/2025 | Products | Ai-development | Ai-service | A public Hugging Face dataset containing ChatGPT-style prompts (e.g., act as an Ethereum developer) for testing and evaluating language model outputs. |
| Continuous Experimentation Mindset | EP 21 Kimi k2 Thinking, The AI Bubble, Nvidia’s Future, and LangChain Experiments | 11/22/2025 | Opinions | Ai-development | - | Adopt an iterative, experiment-driven approach to prompt engineering and model evaluation to systematically improve performance and gather rapid feedb... |
| Context Window Data Formatting | EP 21 Kimi k2 Thinking, The AI Bubble, Nvidia’s Future, and LangChain Experiments | 11/22/2025 | Frameworks | Ai-development | - | Implement precise data passing into the model’s context window, including formatting, conversions, and calculations, to improve prompt reliability. |
| Curated Prompt Test Suite | EP 21 Kimi k2 Thinking, The AI Bubble, Nvidia’s Future, and LangChain Experiments | 11/22/2025 | Frameworks | Ai-development | - | Maintain a curated dataset of prompts, expected outputs, and evaluation criteria to continuously test model upgrades and prompt engineering workflows. |
| Pairwise Output Comparison | EP 21 Kimi k2 Thinking, The AI Bubble, Nvidia’s Future, and LangChain Experiments | 11/22/2025 | Frameworks | Ai-development | - | Set up pairwise experiments where an LLM compares two nondeterministic outputs and declares which is better to guide prompt improvements. |
| LLM-Based Output Evaluation | EP 21 Kimi k2 Thinking, The AI Bubble, Nvidia’s Future, and LangChain Experiments | 11/22/2025 | Frameworks | Ai-development | - | Use OpenAI’s evaluation framework to have an LLM judge outputs and assign a numeric score (e.g., 1–100) as an automated quality metric. |
| Prompt Dataset Evaluation | EP 21 Kimi k2 Thinking, The AI Bubble, Nvidia’s Future, and LangChain Experiments | 11/22/2025 | Frameworks | Ai-development | - | Leverage Hugging Face’s ChatGPT prompt dataset to run experiments on diverse prompt examples and build automated evaluators. |
| Agent Engineering as Software Testing | EP 21 Kimi k2 Thinking, The AI Bubble, Nvidia’s Future, and LangChain Experiments | 11/22/2025 | Opinions | Ai-development | - | View agent engineering through the same lens as traditional software development by applying known testing paradigms for maintainability and clarity. |
| Multi-Metric Optimization | EP 21 Kimi k2 Thinking, The AI Bubble, Nvidia’s Future, and LangChain Experiments | 11/22/2025 | Frameworks | Ai-development | - | Optimize AI outputs not only for accuracy but also for cost or other business metrics by iterating on prompts and workflows against reference datasets... |
| Iterative Benchmarking Workflow | EP 21 Kimi k2 Thinking, The AI Bubble, Nvidia’s Future, and LangChain Experiments | 11/22/2025 | Frameworks | Ai-development | - | Iteratively test different app versions, prompts, data formats, or AI models against a gold standard dataset to benchmark accuracy and optimize outcom... |
| Unit and Regression Testing | EP 21 Kimi k2 Thinking, The AI Bubble, Nvidia’s Future, and LangChain Experiments | 11/22/2025 | Frameworks | Ai-development | - | Leverage traditional software testing frameworks like unit and regression tests to systematically evaluate AI agent performance using reference inputs... |
© 2025 The Build. All rights reserved.
Privacy Policy