K4NUL

Harness Engineering 10. From Documents to an Observable Harness

2026-04-19T00:00:00+09:00

Summary

Writing longer documents is not the solution. That is where the entire series eventually lands. Files like AGENTS.md or CLAUDE.md can be useful starting points, but making them denser does not magically create a harness. In this final post, I want to gather the discussion into a practical roadmap for moving from document-centered operations to a harness that is observable and verifiable.

Verification scope and interpretation boundary

As of: April 15, 2026, checked OpenAI Codex docs, OpenAI platform docs, and Anthropic Claude Code docs.
Source grade: official docs first, plus vendor-authored engineering posts only when a concept is introduced there.
Fact boundary: only documented features such as AGENTS.md, memory/settings, hooks, subagents, approvals, sandboxing, evals, and trace are treated as factual claims.
Interpretation boundary: terms like harness engineering, control plane, contract, enforcement, and observable harness are operating abstractions used in this series unless a source line says otherwise.

If You Compress the Problem into One Sentence

The real issue was never that there were too few documents. It was that too much of the harness still lived only inside documents. When good principles remain only as prose, they are easy to miss. Handoffs drift, eval looks only at artifacts, and trace and guardrails remain too thin to stabilize the system. Interpretation: this section is my own one-sentence summary of the series. The factual anchor is that the official docs already separate instruction files, hooks, eval, trace, and approvals into distinct operating layers(OpenAI AGENTS.md, Hooks, Evaluation best practices, Trace grading, Agent approvals & security, Claude hooks, Claude permissions)

Core Principles of the Transition

The direction of change is actually straightforward. Responsibilities that were overloaded into documentation have to be redistributed across multiple layers. Documented fact: the official docs already expose instruction, automation/hooks, evaluation, trace, and approval or permission as separate layers(OpenAI AGENTS.md, Hooks, Evaluation best practices, Trace grading, Agent approvals & security, Claude hooks, Claude permissions) Interpretation: the transition principles in this post are my recommendation for turning those layers into an operating design.

Keep the project instruction file light and treat it as an entrypoint.
Split detailed operating documents into separate locations.
Structure handoff as schema.
Add agent workflow eval alongside product tests.
Lift repeated document rules into CI, hooks, or validation.
Preserve trace, not just outputs.
Define approval boundaries and guardrails.

If a team keeps solving problems by adding more prose, the result is more sentences like “update the docs,” “leave a handoff,” or “respect ownership.” If the team instead separates roles and adds automatic checks, the instruction file gets shorter while outer layers emerge: docs freshness checks, handoff schema validation, changed-path ownership checks, and trace review. The difference is not document length. It is where the harness actually lives.

Taken together, the official documentation and case studies point to a broadly similar transition order across tools. Whether the example is Codex, Claude Code, or another agentic coding environment, the path still moves toward a lighter entrypoint and stronger outer layers for schema, enforcement, eval, trace, and guardrails.

A Realistic Order of Adoption

In practice, this is easier to do as a gradual reconstruction than as one massive rewrite. Interpretation: the adoption order here is my staged rollout recommendation. The official docs do not prescribe this exact sequence.

Lighten the project instruction file first. Leave only core expectations and reference paths.
Split detailed operating procedures into separate documents so explanation and system rules stop competing in the same place.
Introduce a handoff schema, even if it starts with only a few required fields.
Add at least one agent workflow eval such as wrong-owner routing, handoff completeness, or stale-doc detection.
Move repeatedly missed rules into CI or hook-based enforcement.
Define approval boundaries and guardrails as part of the execution model rather than as an afterthought.

How Small Projects Can Start

Small projects can begin even more simply. A practical four-week transition plan might look like this. Interpretation: the four-week plan is my proposal. The factual basis is simply that instruction, hooks, eval, trace, and approvals already exist as separate documented layers(OpenAI AGENTS.md, Hooks, Evaluation best practices, Trace grading, Agent approvals & security)

Week 1: cut the instruction file in half and move detailed procedures into separate docs.
Week 2: introduce one handoff schema and one docs freshness check.
Week 3: add one agent workflow eval such as wrong-owner routing or stale-doc detection.
Week 4: define a minimal set of trace fields and approval boundaries.

The point is not to build a perfect system immediately. Smaller projects are often more tempted to pile everything into one control-plane document. In reality, they benefit even more from starting with a few small schemas, checks, and trace fields.

How I Used to See the Problem, and How I See It Now

At first, I also tried to solve the problem by writing a better AGENTS.md. It seemed like better rules, more detailed role definitions, and more careful handoff and documentation guidance should make the system more stable. Over time, though, it became much clearer that the core issue was not adding more documentation, but redesigning the operating structure itself.

That shift is less about reflection and more about design. Writing documents still matters, but a harness only becomes a system once schema, enforcement, eval, trace, and guardrails enter the picture.

Wrap-up

The main argument of the series is simple. Agent-system quality does not become stable because of a single prompt or a longer operating document. Instruction files should become lightweight entrypoints, while detailed operating behavior becomes testable and observable through schema, eval, enforcement, trace, and guardrails. The shift from document-centered operations to an observable harness is less a massive one-time redesign than a gradual reconstruction that keeps moving repeated failure points out of prose and into outer layers.

If I had to leave one final sentence, it would be this: do not start by writing more documentation; start by deciding what needs to become visible and checkable outside the document.

One-Sentence Summary of the Series

Harness engineering is not about writing more guidance; it is about moving operating principles out of prose and into schema, eval, enforcement, trace, and guardrails so the system becomes observable and verifiable.

Closing Thoughts

The best place to start is not by extending the instruction file, but by checking whether that file is carrying too many responsibilities already. After that, pick one rule that is repeatedly missed and turn it into a schema or a check. Even one structured handoff, one docs freshness check, or a few trace fields can noticeably change how stable the system feels. In a small project, it is usually better to move one shaky area into an outer layer than to attempt a total redesign. In that sense, the end of this series is closer to a starting line than a conclusion. What matters now is not more documentation, but an operating structure that is easier to observe and easier to verify.

Sources and references

OpenAI, Custom instructions with AGENTS.md
OpenAI, Hooks
OpenAI, Evaluation best practices
OpenAI, Trace grading
OpenAI, Agent approvals & security
Anthropic, Hooks reference
Anthropic, Permissions

하네스 엔지니어링 10. 문서 중심 운영에서 관측 가능한 하네스로: 실전 전환 로드맵

2026-04-19T00:00:00+09:00

요약

문서를 더 길게 쓰는 것만으로는 해결되지 않는다. 시리즈 전체를 관통한 문제의식도 결국 여기로 모인다. AGENTS.md나 CLAUDE.md 같은 지침 파일을 더 촘촘히 쓰는 일은 출발점일 수는 있지만, 그 자체가 하네스를 만들어 주지는 않는다. 이번 최종편에서는 지금까지의 논의를 묶어, 문서 중심 운영에서 관측 가능하고 검증 가능한 하네스로 실제로 어떻게 전환할 수 있는지 정리해보려 한다.

검증 기준과 해석 경계

시점: 2026-04-15 기준 OpenAI Codex 문서, OpenAI 공식 API 문서, Anthropic Claude Code 문서를 확인했다.
출처 등급: 공식 문서 우선, 개념을 처음 소개한 vendor-authored 글만 보조로 사용했다.
사실 범위: AGENTS.md, memory/settings, hooks, subagents, approvals, sandboxing, eval, trace처럼 문서로 확인 가능한 기능만 사실로 적었다.
해석 범위: harness engineering, control plane, contract, enforcement, observable harness 같은 표현은 이 시리즈에서 쓰는 운영 개념이며, 별도 근거 줄이 없는 경우 필자의 해석이다.

지금까지의 문제를 한 문장으로 다시 정리하면

문제가 된 것은 문서가 적어서가 아니라, 하네스가 문서에만 머물러 있었다는 점이다. 좋은 원칙이 자연어로만 남아 있으면 놓치기 쉽고, handoff는 흩어지고, eval은 결과물만 보게 되고, trace와 guardrail이 비면 시스템은 겉보기보다 훨씬 불안정해진다. 해석: 이 절은 시리즈 전체를 한 문장으로 요약하는 필자의 관점이다. 다만 공식 문서들을 보면 instruction file, hooks, eval, trace, approvals가 실제로 분리된 운영 계층으로 존재한다(OpenAI AGENTS.md, Hooks, Evaluation best practices, Trace grading, Agent approvals & security, Claude hooks, Claude permissions)

전환의 핵심 원칙

전환의 방향은 복잡하지 않다. 문서에 몰려 있던 역할을 여러 층으로 다시 나누면 된다. 문서로 확인 가능한 사실: 공식 문서에는 이미 instruction, automation/hooks, evaluation, trace, approval/permission이 분리된 계층으로 등장한다(OpenAI AGENTS.md, Hooks, Evaluation best practices, Trace grading, Agent approvals & security, Claude hooks, Claude permissions) 해석: 이 글의 전환 원칙은 그 계층들을 운영 구조로 재배치하는 필자의 권고다.

프로젝트 지침 파일은 entrypoint로 경량화한다.
세부 운영 문서는 별도 경로로 분리한다.
handoff는 schema로 구조화한다.
제품 테스트와 별도로 agent workflow eval을 둔다.
문서 규칙은 CI, hook, validation 같은 enforcement로 올린다.
결과뿐 아니라 trace를 남긴다.
approval boundary와 guardrail을 함께 설계한다.

예를 들어 문제를 문서만 계속 추가하는 방식으로 풀려 하면, “문서를 갱신해라”, “handoff를 남겨라”, “owner를 지켜라” 같은 문장이 늘어난다. 반대로 역할을 분리하고 자동 검증을 붙이는 방식으로 가면 지침 파일은 짧아지고, docs freshness check, handoff schema validation, changed-path ownership check, trace review 같은 바깥 층이 생긴다. 차이는 문서의 분량이 아니라, 하네스가 어디에 존재하느냐다.

공식 문서와 사례를 종합해보면, 도구는 달라도 전환 순서는 크게 비슷하다. Codex든 Claude Code든 결국 entrypoint를 가볍게 하고, schema, enforcement, eval, trace, guardrail을 바깥 레이어로 세우는 방향으로 가게 된다.

현실적인 적용 순서

현실에서는 한 번에 다 바꾸기보다 점진적으로 재구성하는 편이 낫다. 해석: 어떤 순서로 옮길지는 필자의 단계적 도입안이다. 공식 문서가 특정 순서를 강제하는 것은 아니다.

프로젝트 지침 파일을 먼저 경량화한다. 저장소의 핵심 기대치와 참조 경로만 남긴다.
운영 절차와 기준은 별도 문서로 분리한다. 사람을 위한 설명과 시스템 규칙을 섞지 않는다.
handoff schema를 도입한다. 최소 필드부터 고정해 누락을 줄인다.
agent workflow eval을 추가한다. wrong-owner routing, handoff completeness, stale docs 같은 실패를 별도로 본다.
반복적으로 놓치는 규칙을 CI나 hook으로 올린다. prose를 enforcement로 바꾸는 단계다.
approval boundary와 guardrail을 명시한다. 위험한 실행 경계를 뒤늦게 붙이지 말고 구조 안에 넣는다.

작은 프로젝트에서는 어떻게 시작할 수 있는가

작은 프로젝트라면 더 단순하게 시작해도 된다. 예를 들어 4주 전환안은 아래처럼 잡을 수 있다. 해석: 이 절의 4주 계획은 필자의 제안이다. 다만 출발점으로 instruction, hooks, eval, trace, approvals 같은 분리된 계층을 확인할 수 있다는 점은 공식 문서로 뒷받침된다(OpenAI AGENTS.md, Hooks, Evaluation best practices, Trace grading, Agent approvals & security)

1주차: 지침 파일을 반으로 줄이고, 세부 문서를 별도 폴더로 뺀다.
2주차: handoff schema 하나와 docs freshness check 하나를 도입한다.
3주차: wrong-owner routing 또는 stale-doc detection 같은 agent workflow eval 하나를 붙인다.
4주차: trace 필드와 approval boundary를 최소 단위로 정의한다.

핵심은 완벽한 체계를 한 번에 만들려 하지 않는 것이다. 작은 프로젝트일수록 더 큰 control plane을 문서 하나에 몰아넣기 쉽다. 오히려 작은 단위의 schema, check, trace부터 붙이는 편이 현실적이다.

내가 처음 문제를 보던 방식과 지금의 차이

처음에는 나 역시 AGENTS.md를 더 잘 쓰는 방향으로 문제를 해결하려 한 적이 있다. 좋은 규칙을 더 적고, 역할을 더 자세히 설명하고, handoff와 문서 규칙을 더 꼼꼼히 넣으면 안정성이 올라갈 것처럼 보였기 때문이다. 하지만 점점 분명해진 것은 문서를 더 추가하는 일이 아니라 운영 구조를 재설계하는 일이 핵심이라는 점이었다.

이 차이는 감상보다 설계의 문제에 가깝다. 문서를 잘 쓰는 일은 여전히 필요하지만, 하네스는 schema, enforcement, eval, trace, guardrail이 들어오면서 비로소 시스템이 된다.

마무리

이 시리즈가 말하고 싶었던 것은 결국 하나다. agent 시스템의 품질은 프롬프트 한 줄이나 문서 한 장으로 안정되지 않는다. 지침 파일은 가벼운 entrypoint가 되어야 하고, 세부 운영은 schema, eval, enforcement, trace, guardrail로 분산된 구조 안에서 검증 가능해져야 한다. 문서 중심 운영에서 관측 가능한 하네스로의 전환은 거대한 일괄 개편이 아니라, 반복적으로 놓치는 지점을 하나씩 바깥 레이어로 끌어내는 점진적 재구성에 더 가깝다.

독자에게 마지막으로 남기고 싶은 문장은 이것이다. 문서를 더 쓰는 것보다, 문서 밖에서 무엇을 자동으로 보이게 하고 검사하게 만들 것인가부터 설계하자.

이 시리즈의 한 줄 요약

하네스 엔지니어링의 핵심은 좋은 문서를 더 많이 쓰는 것이 아니라, 문서에 머물러 있던 운영 원칙을 schema, eval, enforcement, trace, guardrail로 옮겨 관측 가능하고 검증 가능한 시스템으로 재구성하는 일이다.

마무리하며

가장 먼저 할 일은 지침 파일을 더 길게 쓰는 것이 아니라, 지금 그 파일이 너무 많은 역할을 떠안고 있는지 보는 것이다. 그다음에는 반복적으로 놓치는 규칙 하나를 골라 schema나 check로 바꿔보는 편이 좋다. handoff 하나를 구조화하고, 문서 갱신 하나를 CI로 올리고, trace 필드 몇 개를 남기는 것만으로도 시스템은 눈에 띄게 달라질 수 있다. 작은 프로젝트라면 더더욱 한 번에 다 하려 하지 말고, 가장 자주 흔들리는 지점부터 바깥 레이어로 옮기면 된다. 이 시리즈가 끝나는 지점은 결론이라기보다 출발선에 가깝다. 이제 필요한 것은 더 많은 문서가 아니라, 더 잘 보이고 더 잘 검증되는 운영 구조다.

출처 및 참고

OpenAI, Custom instructions with AGENTS.md
OpenAI, Hooks
OpenAI, Evaluation best practices
OpenAI, Trace grading
OpenAI, Agent approvals & security
Anthropic, Hooks reference
Anthropic, Permissions

Rust 11. File I/O and Command-Line Input

2026-04-18T09:00:00+09:00

Summary

Once you move from Rust syntax exercises into real tools, you eventually need to read files, accept arguments, and handle errors. Understanding std::env::args, std::fs::read_to_string, std::fs::write, and Result as one flow makes small CLI programs much easier to build.

This post focuses on the simplest pattern: accept a file path from the command line, read the file, process the text, and print or save a summary. The practical beginner approach is to read arguments in main, use the standard library for file access, keep the core calculation in a separate function, and propagate failures upward with Result and ?.

Document Information

Written on: 2026-04-15
Verification date: 2026-04-16
Document type: tutorial
Test environment: Windows 11 Pro, Windows PowerShell, Cargo CLI examples
Test version: rustc 1.94.0, cargo 1.94.0
Source grade: only official documentation is used.
Note: this post covers only the basic pattern for small UTF-8 text CLI tools and leaves out large-file streaming and advanced argument parsers.

Problem Definition

It is hard to feel that you have really learned Rust if every example ends at syntax. Real beginner tools usually need a flow like this:

accept a path or option from the command line
read a file
process the contents
return the result to the terminal or another file

This post stays with the most basic text-file workflow. Large-file streaming, binary data, and advanced argument parsing libraries are intentionally left out.

Verified Facts

According to the standard library docs, std::env::args returns an iterator over the command-line arguments of the current process, and the first item is typically the program path. Evidence: std::env::args
According to the standard library docs, std::fs::read_to_string reads a whole file into a String, and non-UTF-8 data can cause an error. Evidence: std::fs::read_to_string
According to the standard library docs, std::fs::write writes a byte sequence to a file, creating it if needed and replacing the full contents if it already exists. Evidence: std::fs::write
According to the official Rust Book, Result and the ? operator form the basic pattern for propagating recoverable errors to the caller. Evidence: Recoverable Errors with Result

One of the smallest useful examples is a CLI that counts lines and words in a text file.

use std::{env, error::Error, fs, io};

fn count_lines_and_words(text: &str) -> (usize, usize) {
    let lines = text.lines().count();
    let words = text.split_whitespace().count();
    (lines, words)
}

fn main() -> Result<(), Box<dyn Error>> {
    let path = env::args().nth(1).ok_or_else(|| {
        io::Error::new(io::ErrorKind::InvalidInput, "usage: cargo run -- ")
    })?;

    let contents = fs::read_to_string(&path)?;
    let (lines, words) = count_lines_and_words(&contents);

    let summary = format!("file = {}\nlines = {}\nwords = {}\n", path, lines, words);

    println!("{}", summary);
    fs::write("summary.txt", &summary)?;

    Ok(())
}

The key reading order for that code is:

Use env::args().nth(1) to get the first real argument.
If it does not exist, create an io::Error and return it through Result.
Read the whole file with fs::read_to_string.
Keep the core calculation in a small function like count_lines_and_words.
Let main handle output and saving.

That structure matters because it connects naturally to testing. Once the actual text-processing logic is separated from file I/O, it can be tested without creating files at all.

Directly Confirmed Results

Directly confirmed result: the Rust toolchain versions available in the current writing environment were:

rustc --version
cargo --version

Observed output:

rustc 1.94.0 (4a4ef493e 2026-03-02)
cargo 1.94.0 (85eff7c80 2026-01-15)

Directly confirmed result: when I placed the following sample.txt next to the example and ran the program, the result was:

Rust makes tools practical.
Rust makes testing easier.

cargo run --quiet -- sample.txt

Observed output:

file = sample.txt
lines = 2
words = 8

Directly confirmed result: the generated summary.txt contained:

file = sample.txt
lines = 2
words = 8

Limitation of direct reproduction: I reproduced the flow with a sample file in a temporary Cargo project, but I did not validate larger files, non-UTF-8 input, or additional error paths.

Interpretation / Opinion

My view is that the first important CLI skill is not a powerful parser library, but clean input/output boundaries.
Opinion: keep file reading, argument handling, and user-facing errors in main, while moving text processing into separate functions.
Opinion: for small UTF-8 text files, read_to_string is the simplest place to start. Going too early into buffering and streaming can hide the larger structure beginners need to learn first.

Limits and Exceptions

std::env::args has Unicode-related limitations, so args_os may be more appropriate when non-Unicode input matters.
read_to_string loads the whole file into memory, which may not fit very large files or binary data.
write replaces the existing contents, so append-style output needs a different API.
Real CLI tools may need crates like clap, BufRead, and more specific error types, but this post intentionally stays with the simplest standard-library pattern.

References

Rust 11. File I/O and Command-Line Input

2026-04-18T09:00:00+09:00

요약

Rust 기초 문법과 컬렉션을 익힌 뒤 실제 도구를 만들기 시작하면 결국 파일을 읽고, 인자를 받고, 에러를 처리해야 한다. 이 지점에서 std::env::args, std::fs::read_to_string, std::fs::write, Result를 한 흐름으로 이해하면 작은 CLI 프로그램을 만드는 장벽이 크게 낮아진다.

이 글은 텍스트 파일을 읽고, 커맨드라인 인자를 받아서, 간단한 요약을 출력하는 가장 기본적인 패턴을 정리한다. 결론부터 말하면 초급 단계에서는 “인자는 main에서 받고, 파일은 표준 라이브러리로 읽고, 핵심 계산은 별도 함수로 빼고, 실패는 Result와 ?로 위로 올린다”는 흐름이 가장 단순하고 유지보수하기 쉽다.

문서 정보

작성일: 2026-04-15
검증 기준일: 2026-04-16
문서 성격: tutorial
테스트 환경: Windows 11 Pro, Windows PowerShell, Cargo CLI 예시
테스트 버전: rustc 1.94.0, cargo 1.94.0
출처 등급: 공식 문서만 사용했다.
비고: 이 글은 작은 UTF-8 텍스트 CLI 기준의 기본 패턴만 다루며, 대용량 스트리밍 처리나 고급 인자 파서는 범위에서 제외한다.

문제 정의

문법 설명만으로는 Rust를 배웠다고 느끼기 어렵다. 실제로는 아래 흐름이 들어가야 비로소 “도구 하나를 끝까지 만들었다”는 감각이 생긴다.

파일 경로나 옵션을 커맨드라인에서 받는다.
파일을 읽는다.
내용을 가공한다.
결과를 화면이나 다른 파일로 돌려준다.

이번 글은 이 흐름 중에서도 가장 기본적인 텍스트 파일 처리 패턴에 집중한다. 대용량 파일 스트리밍, binary 파일, 정교한 인자 파싱 라이브러리 사용은 범위에서 제외한다.

확인된 사실

표준 라이브러리 문서 기준으로 std::env::args는 현재 프로세스의 커맨드라인 인자를 iterator로 반환하며, 첫 번째 값은 보통 프로그램 경로다. 근거: std::env::args
표준 라이브러리 문서 기준으로 std::fs::read_to_string은 파일 전체를 읽어 String으로 반환하고, UTF-8이 아닌 데이터는 에러가 될 수 있다. 근거: std::fs::read_to_string
표준 라이브러리 문서 기준으로 std::fs::write는 바이트 시퀀스를 파일에 기록하며, 파일이 없으면 만들고 있으면 전체 내용을 교체한다. 근거: std::fs::write
공식 문서 기준으로 Result와 ? 연산자는 에러를 호출자에게 전달하는 기본 패턴이다. 근거: Recoverable Errors with Result

가장 작은 예제는 아래처럼 파일에서 줄 수와 단어 수를 세는 CLI다.

use std::{env, error::Error, fs, io};

fn count_lines_and_words(text: &str) -> (usize, usize) {
    let lines = text.lines().count();
    let words = text.split_whitespace().count();
    (lines, words)
}

fn main() -> Result<(), Box<dyn Error>> {
    let path = env::args().nth(1).ok_or_else(|| {
        io::Error::new(io::ErrorKind::InvalidInput, "usage: cargo run -- ")
    })?;

    let contents = fs::read_to_string(&path)?;
    let (lines, words) = count_lines_and_words(&contents);

    let summary = format!("file = {}\nlines = {}\nwords = {}\n", path, lines, words);

    println!("{}", summary);
    fs::write("summary.txt", &summary)?;

    Ok(())
}

이 코드를 읽는 핵심 순서는 아래와 같다.

env::args().nth(1)로 첫 번째 실제 인자를 꺼낸다.
인자가 없으면 io::Error를 만들어 Result로 반환한다.
fs::read_to_string으로 파일 전체를 읽는다.
핵심 계산은 count_lines_and_words 같은 작은 함수에서 처리한다.
화면 출력과 파일 저장은 main에서 마무리한다.

이 구조가 중요한 이유는 이후 testing과도 자연스럽게 연결되기 때문이다. 실제 계산 로직이 순수 함수에 있으면 파일 입출력 없이도 테스트할 수 있다.

직접 확인한 결과

직접 확인한 결과: 현재 작성 환경에서 Rust toolchain 버전은 아래와 같았다.

rustc --version
cargo --version

관찰된 결과:

rustc 1.94.0 (4a4ef493e 2026-03-02)
cargo 1.94.0 (85eff7c80 2026-01-15)

직접 확인한 결과: 아래처럼 sample.txt를 두고 본문 예제를 실행했을 때 결과는 아래와 같았다.

Rust makes tools practical.
Rust makes testing easier.

cargo run --quiet -- sample.txt

관찰된 결과:

file = sample.txt
lines = 2
words = 8

직접 확인한 결과: 같은 실행에서 생성된 summary.txt 내용은 아래와 같았다.

file = sample.txt
lines = 2
words = 8

직접 확인 범위의 한계: 대표 입력 파일과 임시 Cargo 프로젝트로 예제 흐름은 재현했지만, 큰 파일, 비UTF-8 입력, 추가 에러 케이스까지는 검증하지 않았다.

해석 / 의견

내 판단으로는 초급 CLI에서 가장 먼저 익혀야 할 것은 고급 인자 파서가 아니라 입출력 경계 분리다.
의견: 파일 읽기, 인자 읽기, 에러 출력은 main에 두고, 텍스트 가공이나 계산은 별도 함수로 빼면 테스트와 재사용이 쉬워진다.
의견: 작은 UTF-8 텍스트 파일을 다루는 단계에서는 read_to_string이 가장 단순하다. 너무 이른 시점에 스트리밍과 버퍼 최적화까지 넣으면 오히려 구조 감각을 놓치기 쉽다.

한계와 예외

std::env::args는 유니코드가 아닌 인자 처리에 제약이 있어, 그런 입력이 필요하면 args_os를 검토해야 한다.
read_to_string은 파일 전체를 메모리에 올리므로 매우 큰 파일이나 binary 파일에는 적합하지 않을 수 있다.
write는 기존 파일 내용을 덮어쓰므로 append가 필요한 경우에는 다른 API가 필요하다.
실제 CLI 도구에서는 clap 같은 라이브러리, BufRead, 더 구체적인 에러 타입이 필요할 수 있지만 이번 글에서는 가장 단순한 표준 라이브러리 패턴만 다뤘다.

참고자료

Harness Engineering 09. Agent Systems Need Approval Boundaries and Guardrails

2026-04-18T00:00:00+09:00

Summary

If a system has many rules and still feels unsafe, the missing piece is often the boundary. A team may have role separation, verification rules, and good documentation principles, yet some actions still execute too easily and some inputs are trusted too quickly. That is why this post focuses on the difference between good operating documents and safe execution boundaries, and on why approval boundaries and guardrails need to be part of an agent system from the beginning.

Verification scope and interpretation boundary

As of: April 15, 2026, checked OpenAI Codex docs, OpenAI platform docs, and Anthropic Claude Code docs.
Source grade: official docs first, plus vendor-authored engineering posts only when a concept is introduced there.
Fact boundary: only documented features such as AGENTS.md, memory/settings, hooks, subagents, approvals, sandboxing, evals, and trace are treated as factual claims.
Interpretation boundary: terms like harness engineering, control plane, contract, enforcement, and observable harness are operating abstractions used in this series unless a source line says otherwise.

What Is an Approval Boundary?

An approval boundary defines where an agent may act on its own and where human approval becomes mandatory. This is more than a permission toggle. It is a design decision about where responsibility changes hands. Documented fact: OpenAI documents agent approvals as a distinct security layer, and Anthropic documents permissions as the layer that defines what actions are allowed(Agent approvals & security, Claude permissions)

The easiest examples are destructive actions. Deletion, deployment, and large-scale modification all have consequences that are hard to undo or broad in impact. A change that overwrites hundreds of files, deploys to production, or deletes existing data should usually require human approval before it proceeds. Writing “be careful” in documentation is not the same thing as making the system stop until a person approves the action.

What Do Guardrails Prevent?

What I mean here by guardrails is a broad practical label rather than a perfectly standardized vendor term. Official documentation often breaks this down into more specific mechanisms such as permissions, hooks, input or output guardrails, and trust verification. But the shared idea is still the same: mechanisms that stop an agent before it goes too far in the wrong direction or that limit risky flows. If approval boundaries define where escalation is required, guardrails define what kinds of actions or flows are not freely allowed in the first place. Documented fact: OpenAI explains guardrails, policy checks, and execution limits in its safety and sandboxing docs, while Anthropic explains control points through hooks, sandboxing, and security guidance(Safety in building agents, OpenAI sandboxing, Claude hooks, Claude sandboxing, Claude security) Interpretation: in this post, guardrail is a practical umbrella term that groups those mechanisms together.

One clear example is preventing untrusted input from being treated as an executable instruction. External documents, issue bodies, web content, and data arriving through connectors or MCP can all be useful context. But they should not directly push agent behavior as if they were system instructions. A sentence found in an outside document should not immediately lead to deletion, deployment, or privilege expansion. External text can provide context, but it should not automatically inherit execution authority.

Where Does This Matter Most?

Approval and guardrails matter especially in three areas. Interpretation: the priority ordering in this section is my own operating judgment. The official docs still support the general direction by treating risky tool use, outside input, permission boundaries, and sandboxing as separate safety concerns(Agent approvals & security, Safety in building agents, Claude permissions, Claude sandboxing)

First, destructive action. Deletion, forceful overwrite, bulk editing, and deployment need clear approval boundaries. Second, external data and untrusted input. This is exactly why outside text should not be treated as direct execution guidance. It may be a factual claim or a suggestion, but it is not authority. Third, connector and MCP usage. In those cases, the system should at least record what was read, which tool was called, and how the result influenced the final decision.

For connector or MCP usage, a minimal log often helps: source, tool_name, query_or_resource, timestamp, and decision_usage. This is not a formal standard schema so much as a practical example of the minimum data that helps trace why external input led to a particular action. The important point is not merely that an external connector was used, but that its impact on the decision path remains visible.

What I Came to See as More Important, Later

Early on, I spent more attention on role separation, verification, and documentation rules. Those questions were easier to see first: who should do what, which tests should exist, and which documents should be updated. Approval and guardrails felt more like optional additions that could be refined later. Interpretation: this section describes how I came to rank approval boundaries and guardrails more highly. The factual anchor is that approvals, permissions, hooks, and sandboxing are all documented as separate operating layers(Agent approvals & security, OpenAI hooks, Claude permissions, Claude hooks)

Over time, it became much clearer that they were not optional at all. Even with good rules, risky actions can happen too easily if the approval boundary is empty. If guardrails are weak, outside input can end up carrying more weight than intended. A good operating document and a safe execution boundary are not the same thing.

Wrap-up

Any agent-first operating document needs approval boundaries and guardrails alongside it. You have to define which tool calls require human approval, where destructive actions must stop, how untrusted text and outside input should be handled, and what needs to be recorded when connectors or MCP are used. In harness engineering, the goal is not just to write many rules, but to make sure those rules operate inside safe execution boundaries.

The next post will turn this into a practical transition roadmap. The question there is how to move from document-centered operations toward an observable harness in a realistic sequence.

One-Sentence Summary

Good operating rules alone do not make an agent system safe; approval boundaries and guardrails must be designed as part of the execution boundary from the start.

Preview of the Next Post

The next post will look at a practical roadmap for moving from document-centered operations to an observable harness. The problems in this series are not isolated issues; in practice they form a transition path that has to be handled step by step. So the next piece will focus on sequencing: what should move out of documents first, what should be lifted into schema and eval, and what should be strengthened through trace and guardrails. The goal is not to create a perfect harness all at once, but to move toward one realistically. That is where the series shifts from concept explanation into practical adoption order.

Sources and references

OpenAI, Agent approvals & security
OpenAI, Sandboxing
OpenAI, Safety in building agents
Anthropic, Permissions
Anthropic, Hooks reference
Anthropic, Sandboxing
Anthropic, Security

하네스 엔지니어링 09. approval과 guardrail이 비어 있으면 agent 시스템은 불안정하다

2026-04-18T00:00:00+09:00

요약

규칙은 많았는데도 왠지 불안하다면, 경계가 비어 있을 가능성이 크다. 역할 분리도 있고 검증 규칙도 있고 문서 원칙도 잘 적혀 있는데, 막상 운영해 보면 어떤 행동은 너무 쉽게 실행되고 어떤 입력은 너무 쉽게 믿어지는 경우가 있다. 그래서 이번 글에서는 좋은 운영 문서와 안전한 실행 경계가 왜 다른 문제인지, 그리고 approval boundary와 guardrail이 왜 agent 시스템에 처음부터 필요했는지를 이야기해보려 한다.

검증 기준과 해석 경계

시점: 2026-04-15 기준 OpenAI Codex 문서, OpenAI 공식 API 문서, Anthropic Claude Code 문서를 확인했다.
출처 등급: 공식 문서 우선, 개념을 처음 소개한 vendor-authored 글만 보조로 사용했다.
사실 범위: AGENTS.md, memory/settings, hooks, subagents, approvals, sandboxing, eval, trace처럼 문서로 확인 가능한 기능만 사실로 적었다.
해석 범위: harness engineering, control plane, contract, enforcement, observable harness 같은 표현은 이 시리즈에서 쓰는 운영 개념이며, 별도 근거 줄이 없는 경우 필자의 해석이다.

approval boundary란 무엇인가

approval boundary는 “어떤 행동까지는 agent가 스스로 해도 되고, 어떤 행동부터는 사람 승인이 필요하다”는 경계를 뜻한다. 이것은 단순한 권한 설정이 아니라, 책임이 넘어가는 지점을 명확히 하는 설계에 가깝다. 문서로 확인 가능한 사실: OpenAI는 agent approvals를 별도 보안 계층으로 설명하고, Anthropic은 permissions를 통해 어떤 행동이 허용되는지 문서화한다(Agent approvals & security, Claude permissions)

가장 쉬운 예는 destructive action이다. 삭제, 배포, 대량 수정 같은 작업은 결과가 되돌리기 어렵거나 영향 범위가 넓다. 예를 들어 수백 개 파일을 한꺼번에 바꾸는 작업, 운영 환경 배포, 기존 데이터를 지우는 작업은 agent가 규칙을 잘 이해하고 있더라도 자동 실행보다 사람 승인을 먼저 두는 편이 안전하다. 좋은 문서에 “조심해라”라고 적는 것과, 실제로 승인 없이는 못 넘어가게 만드는 것은 전혀 다른 층의 이야기다.

guardrail은 무엇을 막는가

여기서 말하는 guardrail은 실무에서 넓게 묶어 부르는 표현에 가깝다. 공식 문서에서는 permissions, hooks, input/output guardrails, trust verification처럼 조금 더 나뉜 장치로 설명되기도 한다. 다만 공통적으로는 agent가 잘못된 방향으로 너무 멀리 가기 전에 멈추게 하거나, 위험한 흐름을 제한하는 역할을 가리킨다. approval boundary가 승인 지점을 정한다면, guardrail은 아예 허용되지 않는 동작이나 위험한 흐름을 제한한다. 문서로 확인 가능한 사실: OpenAI는 safety guide와 sandboxing 문서에서 guardrail, policy check, 실행 제한을 설명하고, Anthropic은 hooks, sandboxing, security 문서에서 실행 전후 통제 지점을 설명한다(Safety in building agents, OpenAI sandboxing, Claude hooks, Claude sandboxing, Claude security) 해석: 이 글의 guardrail은 이 여러 장치를 실무적으로 묶어 부르는 표현이다.

예를 들어 비신뢰 입력을 그대로 실행 지시처럼 다루지 못하게 하는 것이 guardrail이다. 외부 문서, 이슈 본문, 웹에서 가져온 텍스트, connector나 MCP를 통해 들어온 데이터는 모두 유용한 문맥이 될 수 있다. 하지만 그 내용이 곧바로 시스템 지시처럼 agent 행동을 밀게 두면 위험하다. 외부 문서에 적혀 있다고 해서 바로 삭제, 배포, 권한 확장 같은 행동으로 이어지면 안 된다. 외부 텍스트는 참고 문맥일 수는 있어도, 곧바로 실행 권한을 얻는 입력이 되어서는 안 된다.

어떤 영역에서 특히 중요해지는가

특히 세 가지 영역에서 approval과 guardrail이 중요해진다. 해석: 어느 영역이 특히 중요하다는 우선순위 판단은 필자의 운영 기준이다. 다만 공식 문서도 고위험 tool use, 외부 입력, 권한 경계, sandboxing을 별도 안전 주제로 다룬다(Agent approvals & security, Safety in building agents, Claude permissions, Claude sandboxing)

첫째는 destructive action이다. 삭제, 강한 overwrite, 대량 변경, 배포 같은 작업은 승인 경계가 필요하다. 둘째는 외부 데이터와 비신뢰 입력이다. 외부 문서의 내용을 그대로 실행 지시로 취급하면 안 되는 이유가 여기에 있다. 텍스트는 사실 주장일 수는 있어도, 권한 위임은 아니다. 셋째는 connector/MCP 같은 외부 연결이다. 이 경우 agent가 무엇을 읽었고, 어떤 도구를 호출했고, 그 결과를 어떤 판단에 사용했는지가 최소한 기록되어야 한다.

예를 들어 connector/MCP 사용 시에는 source, tool_name, query_or_resource, timestamp, decision_usage 같은 필드를 남기는 편이 좋다. 이것은 공식 표준 스키마라기보다, 나중에 “이 외부 데이터가 왜 이 행동으로 이어졌는가”를 추적하기 위한 실무적 최소 예시에 가깝다. 외부 연결을 쓴다는 사실 자체보다, 그 연결이 판단 과정에 어떤 영향을 줬는지 보이는 것이 더 중요하다.

내가 뒤늦게 더 중요하게 보게 된 항목

나 역시 초기에 agent 운영을 설계할 때는 역할 분리, 검증, 문서 규칙 쪽에 더 집중한 적이 있다. 무엇을 어떻게 맡길지, 어떤 테스트를 붙일지, 어떤 문서를 갱신할지가 더 눈에 잘 들어왔기 때문이다. approval과 guardrail은 그다음에 덧붙이는 옵션처럼 보이기도 했다. 해석: 이 절은 필자가 approval boundary와 guardrail을 더 중요한 축으로 보게 된 경험을 정리한 부분이다. 문서로 확인 가능한 사실은 approvals, permissions, hooks, sandboxing이 모두 별도 운영 계층으로 존재한다는 점이다(Agent approvals & security, OpenAI hooks, Claude permissions, Claude hooks)

하지만 나중에 보니 이 부분은 뒤늦게 붙이는 장식이 아니라, 처음부터 설계해야 하는 항목에 더 가까웠다. 규칙이 좋아도 승인 경계가 비어 있으면 위험한 행동이 너무 쉽게 실행될 수 있고, guardrail이 약하면 외부 입력이 생각보다 큰 힘을 가지게 된다. 좋은 운영 문서와 안전한 실행 경계는 같은 것이 아니었다.

마무리

agent-first 운영 문서라면 approval boundary와 guardrail이 반드시 필요하다. 어떤 도구 호출은 사람 승인이 필요한지, destructive action은 어디서 멈춰야 하는지, 외부 텍스트와 비신뢰 입력은 어떻게 다뤄야 하는지, connector/MCP 사용 시 무엇을 기록해야 하는지를 처음부터 정해야 한다. 하네스 엔지니어링에서 중요한 것은 규칙을 많이 쓰는 일이 아니라, 그 규칙이 안전한 실행 경계 안에서 작동하게 만드는 일이다.

다음 글에서는 이 흐름을 실제 전환 로드맵으로 이어가 보려 한다. 문서 중심 운영에서 관측 가능한 하네스로 넘어갈 때 무엇부터 바꾸면 되는지를 실전 순서로 정리해볼 생각이다.

이 글의 한 줄 요약

좋은 운영 규칙만으로는 agent 시스템을 안전하게 만들 수 없고, approval boundary와 guardrail 같은 실행 경계를 처음부터 함께 설계해야 한다.

다음 글 예고

다음 편에서는 “문서 중심 운영에서 관측 가능한 하네스로: 실전 전환 로드맵”을 다뤄보려 한다. 지금까지 시리즈에서 본 문제들은 서로 따로 떨어져 있지 않고, 실제로는 한 번에 조금씩 옮겨가야 하는 전환 과제에 가깝다. 그래서 다음 글에서는 무엇을 먼저 문서에서 분리하고, 무엇을 schema와 eval로 올리고, 무엇을 trace와 guardrail로 보강할지 순서를 잡아볼 생각이다. 처음부터 완벽한 하네스를 만드는 대신, 어떤 단계로 현실적으로 옮겨갈 수 있는지가 더 중요하기 때문이다. 이제부터는 개념 설명을 넘어 실제 적용의 순서를 다룰 차례다.

출처 및 참고

OpenAI, Agent approvals & security
OpenAI, Sandboxing
OpenAI, Safety in building agents
Anthropic, Permissions
Anthropic, Hooks reference
Anthropic, Sandboxing
Anthropic, Security

Rust 10. Testing in Rust

2026-04-17T09:00:00+09:00

Summary

In Rust, getting code to run once is not the same as being able to trust future changes. As soon as functions grow and files start splitting apart, cargo test becomes one of the most practical tools for staying confident during refactoring.

This post introduces cargo test, #[test], assert_eq!, #[cfg(test)], integration tests, and tests that return Result. The practical conclusion is that Rust becomes much easier to test when input and output stay thin, core logic is kept in small functions, and reusable logic lives in lib.rs.

Document Information

Written on: 2026-04-15
Verification date: 2026-04-16
Document type: tutorial
Test environment: Windows 11 Pro, Windows PowerShell, Cargo CLI examples
Test version: rustc 1.94.0, cargo 1.94.0
Source grade: only official documentation is used.
Note: this post focuses on the beginner testing workflow and intentionally leaves broader topics such as async testing and property-based testing for separate material.

Problem Definition

At the beginner stage, it is easy to think that a successful run means the program is done. But once code starts changing, several problems appear immediately.

Refactoring becomes risky because you have to check behavior manually.
Edge cases need to be rerun by hand over and over.
If all logic stays in main.rs, it is hard to verify just one behavior in isolation.

The goal of this post is to treat testing not as a heavy process, but as the default way to keep small Rust programs safe to change.

Verified Facts

According to the official Cargo docs, cargo test builds and runs the tests for a project. Evidence: cargo-test(1)
According to the official Rust Book, the #[test] attribute marks a test function, and macros like assert!, assert_eq!, and assert_ne! are used to verify expectations. Evidence: How to Write Tests
According to the official docs, unit tests are commonly placed in a #[cfg(test)] module in the same file, while integration tests go in the tests/ directory. Evidence: Test Organization
According to the official docs, tests can also be written to return Result<(), E>. Evidence: How to Write Tests

One of the most practical beginner layouts is this:

rust-testing-basics/
  Cargo.toml
  src/
    lib.rs
  tests/
    normalize.rs

Put the function you want to verify in src/lib.rs.

pub fn normalize_name(input: &str) -> String {
    input.trim().to_lowercase()
}

#[cfg(test)]
mod tests {
    use super::normalize_name;

    #[test]
    fn trims_and_lowercases() {
        assert_eq!(normalize_name("  Alice "), "alice");
    }

    #[test]
    fn keeps_empty_string_after_trim() {
        assert_eq!(normalize_name("   "), "");
    }

    #[test]
    fn result_based_test() -> Result<(), String> {
        let actual = normalize_name("Rust");

        if actual == "rust" {
            Ok(())
        } else {
            Err(format!("unexpected value: {}", actual))
        }
    }
}

Then place an integration test in tests/normalize.rs to verify the public API from the outside.

use rust_testing_basics::normalize_name;

#[test]
fn integration_test_uses_public_api() {
    assert_eq!(normalize_name("  Bob  "), "bob");
}

One practical detail is that the package name rust-testing-basics is accessed in code as rust_testing_basics. This structure also prepares you well for later posts, because once file I/O or CLI logic is added, you can still keep the core logic testable inside lib.rs.

Directly Confirmed Results

Directly confirmed result: the Rust toolchain versions available in the current writing environment were:

rustc --version
cargo --version

Observed output:

rustc 1.94.0 (4a4ef493e 2026-03-02)
cargo 1.94.0 (85eff7c80 2026-01-15)

Directly confirmed result: when I ran cargo test --quiet against the same src/lib.rs and tests/normalize.rs structure as the post, the key output was:

cargo test --quiet

Observed output:

running 3 tests
...
test result: ok. 3 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

running 1 test
.
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

Limitation of direct reproduction: I reproduced the test run in a temporary Cargo project, but I did not add a separate example test project to this repository.

Interpretation / Opinion

My view is that the most important beginner testing skill is not test syntax, but function boundaries.
Opinion: if main.rs stays thin and the core transformation logic moves into lib.rs, testing becomes much easier.
Opinion: unit tests are good for small rules, while integration tests are good for checking whether the public API is wired correctly. Beginners benefit from seeing both at least once.

Limits and Exceptions

This post covers only the testing basics and does not include benchmarking, property-based testing, or snapshot testing.
Async tests, database tests, and network tests may require extra runtime or environment setup, so they are outside the scope here.
Options for controlling parallel execution, filtering tests, or handling output are documented in more detail in cargo test.
The exact way you design failure messages and fixtures can vary by team or project.

References

Rust 10. Testing in Rust

2026-04-17T09:00:00+09:00

요약

Rust에서 코드를 실행하는 것과 코드를 검증하는 것은 다른 단계다. 예제가 한 번 돌아갔다고 해서 이후 수정에도 같은 결과가 유지되는 것은 아니기 때문에, cargo test와 기본 테스트 구조를 일찍 익히는 편이 훨씬 안전하다.

이 글은 cargo test, #[test], assert_eq!, #[cfg(test)], integration test, Result를 반환하는 테스트를 초급자 기준으로 정리한다. 결론부터 말하면 테스트하기 쉬운 Rust 코드를 만들려면 “입출력은 얇게, 핵심 로직은 작은 함수로, 재사용할 함수는 lib.rs에”라는 원칙을 먼저 잡는 것이 좋다.

문서 정보

작성일: 2026-04-15
검증 기준일: 2026-04-16
문서 성격: tutorial
테스트 환경: Windows 11 Pro, Windows PowerShell, Cargo CLI 예시
테스트 버전: rustc 1.94.0, cargo 1.94.0
출처 등급: 공식 문서만 사용했다.
비고: 이 글은 초급 테스트 흐름에 집중하며, async 테스트나 property-based testing 같은 확장 주제는 별도 문서 범위로 남겨 둔다.

문제 정의

초급 단계에서는 프로그램이 한 번 실행되면 “이제 됐다”고 느끼기 쉽다. 하지만 함수가 늘어나고 파일이 분리되기 시작하면 아래 문제가 바로 생긴다.

리팩터링 뒤에 예전 동작이 깨졌는지 눈으로만 확인해야 한다.
edge case를 계속 수동으로 재실행해야 한다.
main.rs에 로직이 몰려 있으면 특정 동작만 따로 검증하기 어렵다.

이번 글의 목표는 테스트를 거창한 품질 절차로 보는 대신, “작은 Rust 함수를 계속 안전하게 바꾸기 위한 기본 도구”로 이해하는 데 있다.

확인된 사실

공식 문서 기준으로 cargo test는 프로젝트의 테스트를 빌드하고 실행한다. 근거: cargo-test(1)
공식 문서 기준으로 #[test] 속성은 테스트 함수로 인식되게 하고, assert!, assert_eq!, assert_ne! 같은 매크로로 기대 결과를 검증할 수 있다. 근거: How to Write Tests
공식 문서 기준으로 단위 테스트는 보통 같은 파일 안의 #[cfg(test)] 모듈에 두고, integration test는 tests/ 디렉터리에 둔다. 근거: Test Organization
공식 문서 기준으로 테스트 함수는 Result<(), E>를 반환하도록 작성할 수도 있다. 근거: How to Write Tests

초급자에게 가장 실용적인 구조는 아래처럼 잡는 것이다.

rust-testing-basics/
  Cargo.toml
  src/
    lib.rs
  tests/
    normalize.rs

src/lib.rs에는 검증하고 싶은 순수 함수를 둔다.

pub fn normalize_name(input: &str) -> String {
    input.trim().to_lowercase()
}

#[cfg(test)]
mod tests {
    use super::normalize_name;

    #[test]
    fn trims_and_lowercases() {
        assert_eq!(normalize_name("  Alice "), "alice");
    }

    #[test]
    fn keeps_empty_string_after_trim() {
        assert_eq!(normalize_name("   "), "");
    }

    #[test]
    fn result_based_test() -> Result<(), String> {
        let actual = normalize_name("Rust");

        if actual == "rust" {
            Ok(())
        } else {
            Err(format!("unexpected value: {}", actual))
        }
    }
}

tests/normalize.rs에는 외부 사용자 시점의 integration test를 둘 수 있다.

use rust_testing_basics::normalize_name;

#[test]
fn integration_test_uses_public_api() {
    assert_eq!(normalize_name("  Bob  "), "bob");
}

여기서 rust-testing-basics package 이름은 코드에서 rust_testing_basics처럼 언더스코어 경로로 접근한다. 이 구조는 앞으로 파일 I/O나 CLI가 들어가더라도 핵심 로직을 lib.rs에 두고 테스트를 유지하기 쉽다는 장점이 있다.

직접 확인한 결과

직접 확인한 결과: 현재 작성 환경에서 Rust toolchain 버전은 아래와 같았다.

rustc --version
cargo --version

관찰된 결과:

rustc 1.94.0 (4a4ef493e 2026-03-02)
cargo 1.94.0 (85eff7c80 2026-01-15)

직접 확인한 결과: 본문 예제와 같은 src/lib.rs, tests/normalize.rs 구조에서 cargo test --quiet를 실행했을 때 핵심 출력은 아래와 같았다.

cargo test --quiet

관찰된 결과:

running 3 tests
...
test result: ok. 3 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

running 1 test
.
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

직접 확인 범위의 한계: 본문 구조를 기준으로 임시 테스트 프로젝트를 만들어 cargo test는 실행했지만, 이 저장소 안에 별도 테스트 예제 프로젝트를 추가하지는 않았다.

해석 / 의견

내 판단으로는 Rust 테스트 입문에서 가장 중요한 것은 테스트 문법보다 함수 경계다.
의견: main.rs에 입출력과 분기만 두고, 핵심 계산이나 변환 로직을 lib.rs로 분리하면 테스트 난이도가 크게 내려간다.
의견: unit test는 작은 규칙을 빠르게 잡아 주고, integration test는 공개 API가 기대대로 연결되는지 보는 데 적합하다. 초급 단계에서는 두 종류를 모두 한 번 경험하는 편이 좋다.

한계와 예외

이 글은 기본 테스트 흐름만 다루며, benchmark, property-based testing, snapshot testing은 포함하지 않았다.
async 함수 테스트, 데이터베이스 테스트, 네트워크 테스트는 runtime과 외부 환경이 추가로 필요할 수 있어 이번 범위에서 제외했다.
병렬 실행 제어, 출력 캡처, 특정 테스트만 실행하는 옵션은 cargo test 문서에 더 자세한 내용이 있다.
실패 메시지와 fixture 구성을 어떻게 설계할지는 팀 규칙에 따라 달라질 수 있다.

참고자료

Harness Engineering 08. Results Alone Are Not Enough; You Need Trace

2026-04-17T00:00:00+09:00

Summary

The result is preserved, but no one knows why the system acted that way. In harness engineering, that situation becomes a real problem surprisingly often. The code output is there and the execution log exists, but the reasoning is missing: why this team was chosen as the owner, why this tool was selected, why the task was delegated to a sub-agent, or where a handoff actually failed. That is why this post focuses on the difference between result logs and the decision-relevant layer inside a trace, and why trace is what makes real improvement possible.

Verification scope and interpretation boundary

As of: April 15, 2026, checked OpenAI Codex docs, OpenAI platform docs, and Anthropic Claude Code docs.
Source grade: official docs first, plus vendor-authored engineering posts only when a concept is introduced there.
Fact boundary: only documented features such as AGENTS.md, memory/settings, hooks, subagents, approvals, sandboxing, evals, and trace are treated as factual claims.
Interpretation boundary: terms like harness engineering, control plane, contract, enforcement, and observable harness are operating abstractions used in this series unless a source line says otherwise.

Result Logs and the Decision-Relevant Layer of Trace

Result logs usually record what happened. They show whether tests passed, which files changed, which commands were run, and what the final artifact looked like. That is useful and necessary. Documented fact: OpenAI’s Agents SDK guide explicitly talks about keeping a “full trace of what happened,” and the trace grading guide treats traces as evaluation inputs(Agents SDK, Trace grading)

What I mean here by the decision-relevant layer of trace is not a formal vendor term, but a descriptive way to point at the parts of a trace that preserve decision context: why the platform team was chosen instead of the auth team, why a certain analysis tool was used instead of a simpler one, why a sub-agent was introduced, or why a handoff happened at that specific point. Storing the result and preserving that decision context may look similar, but they serve different purposes.

What Goes Wrong When Trace Is Missing

When trace is weak, teams are forced to guess from the output alone. Imagine a case where the final code exists, but the work was routed to the wrong owner. Looking at the result may tell you who ended up touching the code, but not why that routing decision happened. That means the next fix often becomes guesswork rather than a real improvement to the harness. Interpretation: this section explains the operational problem of reconstructing failures from outputs alone. The factual anchor is that trace is meant to carry more than the final artifact, including intermediate decisions and tool use(Trace grading)

Tool choice works the same way. An execution log may show which tool was used, but not why it was chosen. Without that reason, the same poor choice can repeat while the harness team still lacks a clear place to intervene. Handoff failure is similar. The result may show that information was missing, but only trace can show which step failed to pass it along.

What Kinds of Reasoning Should Be Preserved?

This does not mean recording every thought in exhaustive detail. It means preserving the minimum reasoning needed for improvement. At a minimum, it helps to capture why an owner was selected, why a tool was chosen, whether a sub-agent was used and why, when the handoff happened, and whether scope was narrowed or expanded. Interpretation: the specific minimum fields listed here are my recommendation. The official direction behind them is that trace grading evaluates reasoning process and tool path, not just the final answer(Trace grading, Subagents)

With that kind of trace, teams can later see decisions like “this change touches both payments/ and docs/payments.md, so the payments owner should handle it” or “a sub-agent was used not for parallelism, but because a distinct domain context was required.” At that point, trace stops being a debugging note and starts becoming harness-improvement data.

How Trace Connects to Eval

Trace is the raw material for eval. If a wrong-owner routing eval fails, trace helps distinguish whether the issue came from a bad ownership decision, stale ownership mapping, or a missing routing rule. Tool-choice eval works the same way. Looking only at the output makes it hard to tell the difference between a lucky success and a well-justified success. Trace makes that difference visible. Documented fact: OpenAI officially documents trace grading as an evaluation technique(Trace grading, Evaluation best practices)

Imagine a task that ended successfully, but the trace shows that a sub-agent was unnecessary, handoff leaked information twice, and documentation updates were delayed. The product result may still be acceptable, but the orchestration quality was weak. Trace is what reveals that gap.

The Lack of Trace I Felt Directly

For a while, I also assumed that execution results and logs would be enough. If I knew what ran and what the final result was, it seemed like I should be able to reconstruct what mattered. Later, it became clear that the more important question was whether the reasoning behind the decisions had been preserved at all. Without that, harness improvement kept collapsing back into guesswork from the outcome.

The real lesson was not just that trace is convenient. It was that trace is what makes structural improvement possible.

Wrap-up

Storing outputs and execution logs still matters. But that alone is not enough to understand an agent system well. Harness engineering includes not only the result, but also the observability of the reasoning process. That is why trace becomes a core asset for debugging, evaluation, and improvement.

The next post will connect this to approval and guardrails. Once you can observe the decision path, the next question is which kinds of decisions should be constrained or escalated before they happen.

One-Sentence Summary

Outputs and execution logs are not enough to improve a harness; you also need trace that preserves why the system made the choices it made.

Preview of the Next Post

The next post will explore why agent systems become unstable when approval and guardrails are missing. Some decisions are not enough to observe after the fact; they need to be blocked, limited, or escalated up front. That is especially true for risky tool use, edits outside intended boundaries, and actions that require stronger permissions. So the next step is to look at how approval and guardrails contribute to stability inside the harness. If trace creates observability, guardrails define the allowed space in which those decisions can happen.

Sources and references

OpenAI, Agents SDK
OpenAI, Trace grading
OpenAI, Evaluation best practices
OpenAI, Subagents

하네스 엔지니어링 08. 결과만으로는 부족하고 trace가 필요하다

2026-04-17T00:00:00+09:00

요약

결과는 남았는데, 왜 그렇게 했는지는 모른다. 하네스 엔지니어링을 하다 보면 이 상황이 꽤 자주 문제로 떠오른다. 코드 결과물도 있고 실행 로그도 있는데, 왜 이 팀이 owner였는지, 왜 이 tool을 골랐는지, 왜 sub-agent에게 넘겼는지, 어느 지점에서 handoff가 실패했는지는 보이지 않는 경우다. 그래서 이번 글에서는 결과 로그와 trace 안의 판단 근거가 왜 다른지, 그리고 왜 trace가 있어야 개선이 가능해지는지 이야기해보려 한다.

검증 기준과 해석 경계

시점: 2026-04-15 기준 OpenAI Codex 문서, OpenAI 공식 API 문서, Anthropic Claude Code 문서를 확인했다.
출처 등급: 공식 문서 우선, 개념을 처음 소개한 vendor-authored 글만 보조로 사용했다.
사실 범위: AGENTS.md, memory/settings, hooks, subagents, approvals, sandboxing, eval, trace처럼 문서로 확인 가능한 기능만 사실로 적었다.
해석 범위: harness engineering, control plane, contract, enforcement, observable harness 같은 표현은 이 시리즈에서 쓰는 운영 개념이며, 별도 근거 줄이 없는 경우 필자의 해석이다.

결과 로그와 trace 안의 판단 근거의 차이

결과 로그는 보통 “무엇이 일어났는가”를 남긴다. 테스트가 통과했는지, 파일이 수정됐는지, 명령이 실행됐는지, 최종 산출물이 무엇인지 같은 정보가 여기에 들어간다. 이것만으로도 운영에 필요한 기본 기록은 남길 수 있다. 문서로 확인 가능한 사실: OpenAI는 Agents SDK 가이드에서 “full trace of what happened”를 설명하고, trace grading 가이드는 trace를 평가 입력으로 다룬다(Agents SDK, Trace grading)

여기서 말하는 판단 근거 trace는 공식 표준 용어라기보다, trace 안에서도 owner 선택, tool 선택, handoff 시점 같은 의사결정 맥락을 가리키는 설명용 표현에 가깝다. 왜 인증 팀이 아니라 플랫폼 팀이 owner로 선택됐는지, 왜 grep 대신 특정 분석 도구를 썼는지, 왜 단일 agent로 끝내지 않고 sub-agent에게 넘겼는지, 왜 handoff를 그 시점에 만들었는지 같은 정보가 여기에 해당한다. 결과물 저장과 이런 판단 근거 추적은 비슷해 보여도 목적이 전혀 다르다.

trace가 없을 때 생기는 운영 문제

trace가 약하면 실패 원인을 결과만 보고 추정해야 한다. 예를 들어 결과물은 남아 있는데 auth 관련 수정이 잘못된 팀으로 라우팅되었다고 해보자. 코드만 보면 “누가 이걸 맡았는가”는 알 수 있어도, 왜 그렇게 판단했는지는 모른다. 그러면 다음 개선은 규칙을 정교화하는 대신 사람의 추측에 기대게 된다. 해석: 이 절은 trace가 약할 때 원인 분석이 결과 추정에 머문다는 운영 문제를 설명한 부분이다. 문서로 확인 가능한 사실은 trace가 output만이 아니라 intermediate decision과 tool use까지 평가 재료가 된다는 점이다(Trace grading)

tool choice도 마찬가지다. 실행 로그에는 어떤 도구를 썼는지는 남을 수 있지만, 왜 그 도구를 골랐는지는 빠지기 쉽다. 그 이유가 남지 않으면 부적절한 선택이 반복되어도 하네스를 어떻게 고쳐야 할지 감이 약해진다. handoff 실패 역시 비슷하다. 결과만 보면 “정보가 부족했다”는 사실은 보여도, 어느 단계에서 누가 무엇을 넘기지 않았는지는 trace가 있어야 분해할 수 있다.

어떤 판단 근거를 남겨야 하는가

모든 생각을 장황하게 기록하자는 뜻은 아니다. 개선에 필요한 최소 판단 근거를 남기자는 뜻에 가깝다. 적어도 owner 선택 이유, tool 선택 이유, sub-agent 위임 여부와 이유, handoff 시점과 누락 여부, 범위 축소나 확장 판단 정도는 추적 가능해야 한다. 해석: owner 선택 이유, tool 선택 이유, sub-agent 위임 여부 같은 항목은 필자가 제안하는 최소 trace 필드다. 다만 OpenAI의 trace grading 문서는 trace에 reasoning process와 tool path를 담아 평가하는 방향을 공식적으로 제시한다(Trace grading, Subagents)

예를 들어 trace가 남아 있다면 “이 변경은 payments/와 docs/payments.md를 함께 건드리므로 결제 owner가 맡아야 한다” 같은 판단을 나중에 다시 볼 수 있다. 또 “sub-agent를 쓴 이유는 병렬 구현이 아니라 별도 도메인 문맥 분리가 필요했기 때문”이라는 흔적이 있으면, 멀티에이전트 사용이 정당했는지도 평가할 수 있다. 이런 trace는 디버깅용 메모가 아니라, 하네스 개선용 데이터에 가깝다.

trace와 eval은 어떻게 연결되는가

trace는 eval의 재료가 된다. wrong-owner routing eval이 실패했을 때, trace가 있으면 owner 선택 근거가 잘못됐는지, ownership mapping이 낡았는지, 라우팅 규칙이 비어 있었는지를 더 빠르게 가를 수 있다. tool-choice eval도 마찬가지다. 결과만 보면 우연히 성공한 실행과 합리적으로 성공한 실행을 구분하기 어렵지만, trace가 있으면 그 차이를 평가할 수 있다. 문서로 확인 가능한 사실: OpenAI는 trace grading을 eval 기법으로 공식 문서화한다(Trace grading, Evaluation best practices)

예를 들어 한 작업에서 결과는 성공했지만 trace를 보니 불필요하게 sub-agent를 호출했고, handoff도 두 번 새면서 문서 갱신이 늦어졌다고 해보자. 이 경우 제품은 맞을 수 있지만 orchestration은 좋지 않았다. trace는 바로 이런 차이를 결과 바깥에서 드러내 준다.

내가 직접 느낀 trace의 부족함

나 역시 한동안은 실행 결과와 로그만 남으면 충분하다고 생각한 적이 있다. 무엇이 실행됐고, 마지막에 어떤 결과가 나왔는지를 알면 대체로 복기할 수 있다고 봤기 때문이다. 하지만 나중에 보니 더 중요한 것은 왜 그런 판단을 했는지가 남아 있는지였다. 그 흔적이 없으면 하네스를 개선하는 일은 결국 결과를 보고 추정하는 작업에 머물기 쉬웠다.

이 경험이 보여준 것은 trace가 있으면 편리하다는 수준이 아니었다. trace가 있어야만 실패를 구조적으로 고칠 수 있다는 점에 더 가까웠다.

마무리

결과물 저장과 실행 로그는 중요하다. 하지만 그것만으로는 agent 시스템을 충분히 이해할 수 없다. 하네스 엔지니어링은 결과뿐 아니라 판단 과정의 관측 가능성까지 포함하며, 바로 그 지점에서 trace는 디버깅과 평가, 개선의 핵심 자산이 된다.

다음 글에서는 이 흐름을 approval과 guardrail로 이어가 보려 한다. trace로 판단 과정을 본다면, 이제 어떤 판단은 아예 제한하거나 승인 흐름에 올려야 하기 때문이다.

이 글의 한 줄 요약

결과와 실행 로그만으로는 하네스를 개선하기 어렵고, 왜 그런 판단을 했는지 보여주는 trace가 있어야 실패 원인을 구조적으로 고칠 수 있다.

다음 글 예고

다음 편에서는 “approval과 guardrail이 비어 있으면 agent 시스템은 불안정하다”를 다뤄보려 한다. 어떤 판단은 trace로 남기는 것만으로는 부족하고, 애초에 특정 경로를 막거나 승인 단계로 올려야 한다. 특히 위험한 도구 사용, 경계 밖 수정, 큰 권한이 필요한 작업은 사후 분석보다 사전 통제가 더 중요해진다. 그래서 다음 글에서는 approval과 guardrail이 왜 하네스의 안정성을 좌우하는지 정리해볼 생각이다. trace가 관측 가능성을 만든다면, guardrail은 그 위에서 허용 범위를 정한다.

출처 및 참고

OpenAI, Agents SDK
OpenAI, Trace grading
OpenAI, Evaluation best practices
OpenAI, Subagents