<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://www.k4nul.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://www.k4nul.com/" rel="alternate" type="text/html" /><updated>2026-04-19T04:23:01+09:00</updated><id>https://www.k4nul.com/feed.xml</id><title type="html">K4NUL</title><subtitle>K4NUL&apos;s Blog.</subtitle><author><name>K4nul</name></author><entry xml:lang="en"><title type="html">Harness Engineering 10. From Documents to an Observable Harness</title><link href="https://www.k4nul.com/en/ai/from-document-centered-ops-to-an-observable-harness/" rel="alternate" type="text/html" title="Harness Engineering 10. From Documents to an Observable Harness" /><published>2026-04-19T00:00:00+09:00</published><updated>2026-04-19T00:00:00+09:00</updated><id>https://www.k4nul.com/en/ai/from-document-centered-ops-to-observable-harness-en</id><content type="html" xml:base="https://www.k4nul.com/en/ai/from-document-centered-ops-to-an-observable-harness/"><![CDATA[<h2 id="summary">Summary</h2>

<p>Writing longer documents is not the solution. That is where the entire series eventually lands. Files like <code class="language-plaintext highlighter-rouge">AGENTS.md</code> or <code class="language-plaintext highlighter-rouge">CLAUDE.md</code> can be useful starting points, but making them denser does not magically create a harness. In this final post, I want to gather the discussion into a practical roadmap for moving from document-centered operations to a harness that is observable and verifiable.</p>

<h2 id="verification-scope-and-interpretation-boundary">Verification scope and interpretation boundary</h2>

<ul>
  <li>As of: April 15, 2026, checked OpenAI Codex docs, OpenAI platform docs, and Anthropic Claude Code docs.</li>
  <li>Source grade: official docs first, plus vendor-authored engineering posts only when a concept is introduced there.</li>
  <li>Fact boundary: only documented features such as <code class="language-plaintext highlighter-rouge">AGENTS.md</code>, memory/settings, hooks, subagents, approvals, sandboxing, evals, and trace are treated as factual claims.</li>
  <li>Interpretation boundary: terms like <code class="language-plaintext highlighter-rouge">harness engineering</code>, <code class="language-plaintext highlighter-rouge">control plane</code>, <code class="language-plaintext highlighter-rouge">contract</code>, <code class="language-plaintext highlighter-rouge">enforcement</code>, and <code class="language-plaintext highlighter-rouge">observable harness</code> are operating abstractions used in this series unless a source line says otherwise.</li>
</ul>

<h2 id="if-you-compress-the-problem-into-one-sentence">If You Compress the Problem into One Sentence</h2>

<p>The real issue was never that there were too few documents. It was that too much of the harness still lived only inside documents. When good principles remain only as prose, they are easy to miss. Handoffs drift, eval looks only at artifacts, and trace and guardrails remain too thin to stabilize the system.
Interpretation: this section is my own one-sentence summary of the series. The factual anchor is that the official docs already separate instruction files, hooks, eval, trace, and approvals into distinct operating layers(<a href="https://developers.openai.com/codex/guides/agents-md">OpenAI AGENTS.md</a>, <a href="https://developers.openai.com/codex/hooks">Hooks</a>, <a href="https://developers.openai.com/api/docs/guides/evaluation-best-practices">Evaluation best practices</a>, <a href="https://developers.openai.com/api/docs/guides/trace-grading">Trace grading</a>, <a href="https://developers.openai.com/codex/agent-approvals-security">Agent approvals &amp; security</a>, <a href="https://code.claude.com/docs/en/hooks">Claude hooks</a>, <a href="https://code.claude.com/docs/en/permissions">Claude permissions</a>)</p>

<h2 id="core-principles-of-the-transition">Core Principles of the Transition</h2>

<p>The direction of change is actually straightforward. Responsibilities that were overloaded into documentation have to be redistributed across multiple layers.
Documented fact: the official docs already expose instruction, automation/hooks, evaluation, trace, and approval or permission as separate layers(<a href="https://developers.openai.com/codex/guides/agents-md">OpenAI AGENTS.md</a>, <a href="https://developers.openai.com/codex/hooks">Hooks</a>, <a href="https://developers.openai.com/api/docs/guides/evaluation-best-practices">Evaluation best practices</a>, <a href="https://developers.openai.com/api/docs/guides/trace-grading">Trace grading</a>, <a href="https://developers.openai.com/codex/agent-approvals-security">Agent approvals &amp; security</a>, <a href="https://code.claude.com/docs/en/hooks">Claude hooks</a>, <a href="https://code.claude.com/docs/en/permissions">Claude permissions</a>)
Interpretation: the transition principles in this post are my recommendation for turning those layers into an operating design.</p>

<ol>
  <li>Keep the project instruction file light and treat it as an entrypoint.</li>
  <li>Split detailed operating documents into separate locations.</li>
  <li>Structure handoff as schema.</li>
  <li>Add agent workflow eval alongside product tests.</li>
  <li>Lift repeated document rules into CI, hooks, or validation.</li>
  <li>Preserve trace, not just outputs.</li>
  <li>Define approval boundaries and guardrails.</li>
</ol>

<p>If a team keeps solving problems by adding more prose, the result is more sentences like “update the docs,” “leave a handoff,” or “respect ownership.” If the team instead separates roles and adds automatic checks, the instruction file gets shorter while outer layers emerge: docs freshness checks, handoff schema validation, changed-path ownership checks, and trace review. The difference is not document length. It is where the harness actually lives.</p>

<p>Taken together, the official documentation and case studies point to a broadly similar transition order across tools. Whether the example is Codex, Claude Code, or another agentic coding environment, the path still moves toward a lighter entrypoint and stronger outer layers for schema, enforcement, eval, trace, and guardrails.</p>

<h2 id="a-realistic-order-of-adoption">A Realistic Order of Adoption</h2>

<p>In practice, this is easier to do as a gradual reconstruction than as one massive rewrite.
Interpretation: the adoption order here is my staged rollout recommendation. The official docs do not prescribe this exact sequence.</p>

<ol>
  <li>Lighten the project instruction file first. Leave only core expectations and reference paths.</li>
  <li>Split detailed operating procedures into separate documents so explanation and system rules stop competing in the same place.</li>
  <li>Introduce a handoff schema, even if it starts with only a few required fields.</li>
  <li>Add at least one agent workflow eval such as wrong-owner routing, handoff completeness, or stale-doc detection.</li>
  <li>Move repeatedly missed rules into CI or hook-based enforcement.</li>
  <li>Define approval boundaries and guardrails as part of the execution model rather than as an afterthought.</li>
</ol>

<h2 id="how-small-projects-can-start">How Small Projects Can Start</h2>

<p>Small projects can begin even more simply. A practical four-week transition plan might look like this.
Interpretation: the four-week plan is my proposal. The factual basis is simply that instruction, hooks, eval, trace, and approvals already exist as separate documented layers(<a href="https://developers.openai.com/codex/guides/agents-md">OpenAI AGENTS.md</a>, <a href="https://developers.openai.com/codex/hooks">Hooks</a>, <a href="https://developers.openai.com/api/docs/guides/evaluation-best-practices">Evaluation best practices</a>, <a href="https://developers.openai.com/api/docs/guides/trace-grading">Trace grading</a>, <a href="https://developers.openai.com/codex/agent-approvals-security">Agent approvals &amp; security</a>)</p>

<ol>
  <li>Week 1: cut the instruction file in half and move detailed procedures into separate docs.</li>
  <li>Week 2: introduce one handoff schema and one docs freshness check.</li>
  <li>Week 3: add one agent workflow eval such as wrong-owner routing or stale-doc detection.</li>
  <li>Week 4: define a minimal set of trace fields and approval boundaries.</li>
</ol>

<p>The point is not to build a perfect system immediately. Smaller projects are often more tempted to pile everything into one control-plane document. In reality, they benefit even more from starting with a few small schemas, checks, and trace fields.</p>

<h2 id="how-i-used-to-see-the-problem-and-how-i-see-it-now">How I Used to See the Problem, and How I See It Now</h2>

<p>At first, I also tried to solve the problem by writing a better <code class="language-plaintext highlighter-rouge">AGENTS.md</code>. It seemed like better rules, more detailed role definitions, and more careful handoff and documentation guidance should make the system more stable. Over time, though, it became much clearer that the core issue was not adding more documentation, but redesigning the operating structure itself.</p>

<p>That shift is less about reflection and more about design. Writing documents still matters, but a harness only becomes a system once schema, enforcement, eval, trace, and guardrails enter the picture.</p>

<h2 id="wrap-up">Wrap-up</h2>

<p>The main argument of the series is simple. Agent-system quality does not become stable because of a single prompt or a longer operating document. Instruction files should become lightweight entrypoints, while detailed operating behavior becomes testable and observable through schema, eval, enforcement, trace, and guardrails. The shift from document-centered operations to an observable harness is less a massive one-time redesign than a gradual reconstruction that keeps moving repeated failure points out of prose and into outer layers.</p>

<p>If I had to leave one final sentence, it would be this: do not start by writing more documentation; start by deciding what needs to become visible and checkable outside the document.</p>

<h2 id="one-sentence-summary-of-the-series">One-Sentence Summary of the Series</h2>

<p>Harness engineering is not about writing more guidance; it is about moving operating principles out of prose and into schema, eval, enforcement, trace, and guardrails so the system becomes observable and verifiable.</p>

<h2 id="closing-thoughts">Closing Thoughts</h2>

<p>The best place to start is not by extending the instruction file, but by checking whether that file is carrying too many responsibilities already. After that, pick one rule that is repeatedly missed and turn it into a schema or a check. Even one structured handoff, one docs freshness check, or a few trace fields can noticeably change how stable the system feels. In a small project, it is usually better to move one shaky area into an outer layer than to attempt a total redesign. In that sense, the end of this series is closer to a starting line than a conclusion. What matters now is not more documentation, but an operating structure that is easier to observe and easier to verify.</p>

<h2 id="sources-and-references">Sources and references</h2>

<ul>
  <li>OpenAI, <a href="https://developers.openai.com/codex/guides/agents-md">Custom instructions with AGENTS.md</a></li>
  <li>OpenAI, <a href="https://developers.openai.com/codex/hooks">Hooks</a></li>
  <li>OpenAI, <a href="https://developers.openai.com/api/docs/guides/evaluation-best-practices">Evaluation best practices</a></li>
  <li>OpenAI, <a href="https://developers.openai.com/api/docs/guides/trace-grading">Trace grading</a></li>
  <li>OpenAI, <a href="https://developers.openai.com/codex/agent-approvals-security">Agent approvals &amp; security</a></li>
  <li>Anthropic, <a href="https://code.claude.com/docs/en/hooks">Hooks reference</a></li>
  <li>Anthropic, <a href="https://code.claude.com/docs/en/permissions">Permissions</a></li>
</ul>]]></content><author><name>K4nul</name></author><category term="AI" /><category term="ai" /><category term="llm" /><category term="harness-engineering" /><category term="roadmap" /><category term="eval" /><category term="trace" /><category term="guardrail" /><summary type="html"><![CDATA[A practical roadmap for moving from document-centered operations to an observable harness.]]></summary></entry><entry xml:lang="ko"><title type="html">하네스 엔지니어링 10. 문서 중심 운영에서 관측 가능한 하네스로: 실전 전환 로드맵</title><link href="https://www.k4nul.com/ai/from-document-centered-ops-to-observable-harness/" rel="alternate" type="text/html" title="하네스 엔지니어링 10. 문서 중심 운영에서 관측 가능한 하네스로: 실전 전환 로드맵" /><published>2026-04-19T00:00:00+09:00</published><updated>2026-04-19T00:00:00+09:00</updated><id>https://www.k4nul.com/ai/from-document-centered-ops-to-observable-harness</id><content type="html" xml:base="https://www.k4nul.com/ai/from-document-centered-ops-to-observable-harness/"><![CDATA[<h2 id="요약">요약</h2>

<p>문서를 더 길게 쓰는 것만으로는 해결되지 않는다. 시리즈 전체를 관통한 문제의식도 결국 여기로 모인다. AGENTS.md나 CLAUDE.md 같은 지침 파일을 더 촘촘히 쓰는 일은 출발점일 수는 있지만, 그 자체가 하네스를 만들어 주지는 않는다. 이번 최종편에서는 지금까지의 논의를 묶어, 문서 중심 운영에서 관측 가능하고 검증 가능한 하네스로 실제로 어떻게 전환할 수 있는지 정리해보려 한다.</p>

<h2 id="검증-기준과-해석-경계">검증 기준과 해석 경계</h2>

<ul>
  <li>시점: 2026-04-15 기준 OpenAI Codex 문서, OpenAI 공식 API 문서, Anthropic Claude Code 문서를 확인했다.</li>
  <li>출처 등급: 공식 문서 우선, 개념을 처음 소개한 vendor-authored 글만 보조로 사용했다.</li>
  <li>사실 범위: <code class="language-plaintext highlighter-rouge">AGENTS.md</code>, memory/settings, hooks, subagents, approvals, sandboxing, eval, trace처럼 문서로 확인 가능한 기능만 사실로 적었다.</li>
  <li>해석 범위: <code class="language-plaintext highlighter-rouge">harness engineering</code>, <code class="language-plaintext highlighter-rouge">control plane</code>, <code class="language-plaintext highlighter-rouge">contract</code>, <code class="language-plaintext highlighter-rouge">enforcement</code>, <code class="language-plaintext highlighter-rouge">observable harness</code> 같은 표현은 이 시리즈에서 쓰는 운영 개념이며, 별도 근거 줄이 없는 경우 필자의 해석이다.</li>
</ul>

<h2 id="지금까지의-문제를-한-문장으로-다시-정리하면">지금까지의 문제를 한 문장으로 다시 정리하면</h2>

<p>문제가 된 것은 문서가 적어서가 아니라, 하네스가 문서에만 머물러 있었다는 점이다. 좋은 원칙이 자연어로만 남아 있으면 놓치기 쉽고, handoff는 흩어지고, eval은 결과물만 보게 되고, trace와 guardrail이 비면 시스템은 겉보기보다 훨씬 불안정해진다.
해석: 이 절은 시리즈 전체를 한 문장으로 요약하는 필자의 관점이다. 다만 공식 문서들을 보면 instruction file, hooks, eval, trace, approvals가 실제로 분리된 운영 계층으로 존재한다(<a href="https://developers.openai.com/codex/guides/agents-md">OpenAI AGENTS.md</a>, <a href="https://developers.openai.com/codex/hooks">Hooks</a>, <a href="https://developers.openai.com/api/docs/guides/evaluation-best-practices">Evaluation best practices</a>, <a href="https://developers.openai.com/api/docs/guides/trace-grading">Trace grading</a>, <a href="https://developers.openai.com/codex/agent-approvals-security">Agent approvals &amp; security</a>, <a href="https://code.claude.com/docs/en/hooks">Claude hooks</a>, <a href="https://code.claude.com/docs/en/permissions">Claude permissions</a>)</p>

<h2 id="전환의-핵심-원칙">전환의 핵심 원칙</h2>

<p>전환의 방향은 복잡하지 않다. 문서에 몰려 있던 역할을 여러 층으로 다시 나누면 된다.
문서로 확인 가능한 사실: 공식 문서에는 이미 instruction, automation/hooks, evaluation, trace, approval/permission이 분리된 계층으로 등장한다(<a href="https://developers.openai.com/codex/guides/agents-md">OpenAI AGENTS.md</a>, <a href="https://developers.openai.com/codex/hooks">Hooks</a>, <a href="https://developers.openai.com/api/docs/guides/evaluation-best-practices">Evaluation best practices</a>, <a href="https://developers.openai.com/api/docs/guides/trace-grading">Trace grading</a>, <a href="https://developers.openai.com/codex/agent-approvals-security">Agent approvals &amp; security</a>, <a href="https://code.claude.com/docs/en/hooks">Claude hooks</a>, <a href="https://code.claude.com/docs/en/permissions">Claude permissions</a>)
해석: 이 글의 전환 원칙은 그 계층들을 운영 구조로 재배치하는 필자의 권고다.</p>

<ol>
  <li>프로젝트 지침 파일은 entrypoint로 경량화한다.</li>
  <li>세부 운영 문서는 별도 경로로 분리한다.</li>
  <li>handoff는 schema로 구조화한다.</li>
  <li>제품 테스트와 별도로 agent workflow eval을 둔다.</li>
  <li>문서 규칙은 CI, hook, validation 같은 enforcement로 올린다.</li>
  <li>결과뿐 아니라 trace를 남긴다.</li>
  <li>approval boundary와 guardrail을 함께 설계한다.</li>
</ol>

<p>예를 들어 문제를 문서만 계속 추가하는 방식으로 풀려 하면, “문서를 갱신해라”, “handoff를 남겨라”, “owner를 지켜라” 같은 문장이 늘어난다. 반대로 역할을 분리하고 자동 검증을 붙이는 방식으로 가면 지침 파일은 짧아지고, docs freshness check, handoff schema validation, changed-path ownership check, trace review 같은 바깥 층이 생긴다. 차이는 문서의 분량이 아니라, 하네스가 어디에 존재하느냐다.</p>

<p>공식 문서와 사례를 종합해보면, 도구는 달라도 전환 순서는 크게 비슷하다. Codex든 Claude Code든 결국 entrypoint를 가볍게 하고, schema, enforcement, eval, trace, guardrail을 바깥 레이어로 세우는 방향으로 가게 된다.</p>

<h2 id="현실적인-적용-순서">현실적인 적용 순서</h2>

<p>현실에서는 한 번에 다 바꾸기보다 점진적으로 재구성하는 편이 낫다.
해석: 어떤 순서로 옮길지는 필자의 단계적 도입안이다. 공식 문서가 특정 순서를 강제하는 것은 아니다.</p>

<ol>
  <li>프로젝트 지침 파일을 먼저 경량화한다. 저장소의 핵심 기대치와 참조 경로만 남긴다.</li>
  <li>운영 절차와 기준은 별도 문서로 분리한다. 사람을 위한 설명과 시스템 규칙을 섞지 않는다.</li>
  <li>handoff schema를 도입한다. 최소 필드부터 고정해 누락을 줄인다.</li>
  <li>agent workflow eval을 추가한다. wrong-owner routing, handoff completeness, stale docs 같은 실패를 별도로 본다.</li>
  <li>반복적으로 놓치는 규칙을 CI나 hook으로 올린다. prose를 enforcement로 바꾸는 단계다.</li>
  <li>approval boundary와 guardrail을 명시한다. 위험한 실행 경계를 뒤늦게 붙이지 말고 구조 안에 넣는다.</li>
</ol>

<h2 id="작은-프로젝트에서는-어떻게-시작할-수-있는가">작은 프로젝트에서는 어떻게 시작할 수 있는가</h2>

<p>작은 프로젝트라면 더 단순하게 시작해도 된다. 예를 들어 4주 전환안은 아래처럼 잡을 수 있다.
해석: 이 절의 4주 계획은 필자의 제안이다. 다만 출발점으로 instruction, hooks, eval, trace, approvals 같은 분리된 계층을 확인할 수 있다는 점은 공식 문서로 뒷받침된다(<a href="https://developers.openai.com/codex/guides/agents-md">OpenAI AGENTS.md</a>, <a href="https://developers.openai.com/codex/hooks">Hooks</a>, <a href="https://developers.openai.com/api/docs/guides/evaluation-best-practices">Evaluation best practices</a>, <a href="https://developers.openai.com/api/docs/guides/trace-grading">Trace grading</a>, <a href="https://developers.openai.com/codex/agent-approvals-security">Agent approvals &amp; security</a>)</p>

<ol>
  <li>1주차: 지침 파일을 반으로 줄이고, 세부 문서를 별도 폴더로 뺀다.</li>
  <li>2주차: handoff schema 하나와 docs freshness check 하나를 도입한다.</li>
  <li>3주차: wrong-owner routing 또는 stale-doc detection 같은 agent workflow eval 하나를 붙인다.</li>
  <li>4주차: trace 필드와 approval boundary를 최소 단위로 정의한다.</li>
</ol>

<p>핵심은 완벽한 체계를 한 번에 만들려 하지 않는 것이다. 작은 프로젝트일수록 더 큰 control plane을 문서 하나에 몰아넣기 쉽다. 오히려 작은 단위의 schema, check, trace부터 붙이는 편이 현실적이다.</p>

<h2 id="내가-처음-문제를-보던-방식과-지금의-차이">내가 처음 문제를 보던 방식과 지금의 차이</h2>

<p>처음에는 나 역시 AGENTS.md를 더 잘 쓰는 방향으로 문제를 해결하려 한 적이 있다. 좋은 규칙을 더 적고, 역할을 더 자세히 설명하고, handoff와 문서 규칙을 더 꼼꼼히 넣으면 안정성이 올라갈 것처럼 보였기 때문이다. 하지만 점점 분명해진 것은 문서를 더 추가하는 일이 아니라 운영 구조를 재설계하는 일이 핵심이라는 점이었다.</p>

<p>이 차이는 감상보다 설계의 문제에 가깝다. 문서를 잘 쓰는 일은 여전히 필요하지만, 하네스는 schema, enforcement, eval, trace, guardrail이 들어오면서 비로소 시스템이 된다.</p>

<h2 id="마무리">마무리</h2>

<p>이 시리즈가 말하고 싶었던 것은 결국 하나다. agent 시스템의 품질은 프롬프트 한 줄이나 문서 한 장으로 안정되지 않는다. 지침 파일은 가벼운 entrypoint가 되어야 하고, 세부 운영은 schema, eval, enforcement, trace, guardrail로 분산된 구조 안에서 검증 가능해져야 한다. 문서 중심 운영에서 관측 가능한 하네스로의 전환은 거대한 일괄 개편이 아니라, 반복적으로 놓치는 지점을 하나씩 바깥 레이어로 끌어내는 점진적 재구성에 더 가깝다.</p>

<p>독자에게 마지막으로 남기고 싶은 문장은 이것이다. 문서를 더 쓰는 것보다, 문서 밖에서 무엇을 자동으로 보이게 하고 검사하게 만들 것인가부터 설계하자.</p>

<h2 id="이-시리즈의-한-줄-요약">이 시리즈의 한 줄 요약</h2>

<p>하네스 엔지니어링의 핵심은 좋은 문서를 더 많이 쓰는 것이 아니라, 문서에 머물러 있던 운영 원칙을 schema, eval, enforcement, trace, guardrail로 옮겨 관측 가능하고 검증 가능한 시스템으로 재구성하는 일이다.</p>

<h2 id="마무리하며">마무리하며</h2>

<p>가장 먼저 할 일은 지침 파일을 더 길게 쓰는 것이 아니라, 지금 그 파일이 너무 많은 역할을 떠안고 있는지 보는 것이다. 그다음에는 반복적으로 놓치는 규칙 하나를 골라 schema나 check로 바꿔보는 편이 좋다. handoff 하나를 구조화하고, 문서 갱신 하나를 CI로 올리고, trace 필드 몇 개를 남기는 것만으로도 시스템은 눈에 띄게 달라질 수 있다. 작은 프로젝트라면 더더욱 한 번에 다 하려 하지 말고, 가장 자주 흔들리는 지점부터 바깥 레이어로 옮기면 된다. 이 시리즈가 끝나는 지점은 결론이라기보다 출발선에 가깝다. 이제 필요한 것은 더 많은 문서가 아니라, 더 잘 보이고 더 잘 검증되는 운영 구조다.</p>

<h2 id="출처-및-참고">출처 및 참고</h2>

<ul>
  <li>OpenAI, <a href="https://developers.openai.com/codex/guides/agents-md">Custom instructions with AGENTS.md</a></li>
  <li>OpenAI, <a href="https://developers.openai.com/codex/hooks">Hooks</a></li>
  <li>OpenAI, <a href="https://developers.openai.com/api/docs/guides/evaluation-best-practices">Evaluation best practices</a></li>
  <li>OpenAI, <a href="https://developers.openai.com/api/docs/guides/trace-grading">Trace grading</a></li>
  <li>OpenAI, <a href="https://developers.openai.com/codex/agent-approvals-security">Agent approvals &amp; security</a></li>
  <li>Anthropic, <a href="https://code.claude.com/docs/en/hooks">Hooks reference</a></li>
  <li>Anthropic, <a href="https://code.claude.com/docs/en/permissions">Permissions</a></li>
</ul>]]></content><author><name>K4nul</name></author><category term="AI" /><category term="ai" /><category term="llm" /><category term="harness-engineering" /><category term="roadmap" /><category term="eval" /><category term="trace" /><category term="guardrail" /><summary type="html"><![CDATA[문서 중심 운영에서 관측 가능한 하네스로 옮겨 가는 실전 전환 로드맵을 정리한 글.]]></summary></entry><entry xml:lang="en"><title type="html">Rust 11. File I/O and Command-Line Input</title><link href="https://www.k4nul.com/en/rust/file-io-and-command-line-input/" rel="alternate" type="text/html" title="Rust 11. File I/O and Command-Line Input" /><published>2026-04-18T09:00:00+09:00</published><updated>2026-04-18T09:00:00+09:00</updated><id>https://www.k4nul.com/en/rust/rust-file-io-and-cli-input-en</id><content type="html" xml:base="https://www.k4nul.com/en/rust/file-io-and-command-line-input/"><![CDATA[<h2 id="summary">Summary</h2>

<p>Once you move from Rust syntax exercises into real tools, you eventually need to read files, accept arguments, and handle errors. Understanding <code class="language-plaintext highlighter-rouge">std::env::args</code>, <code class="language-plaintext highlighter-rouge">std::fs::read_to_string</code>, <code class="language-plaintext highlighter-rouge">std::fs::write</code>, and <code class="language-plaintext highlighter-rouge">Result</code> as one flow makes small CLI programs much easier to build.</p>

<p>This post focuses on the simplest pattern: accept a file path from the command line, read the file, process the text, and print or save a summary. The practical beginner approach is to read arguments in <code class="language-plaintext highlighter-rouge">main</code>, use the standard library for file access, keep the core calculation in a separate function, and propagate failures upward with <code class="language-plaintext highlighter-rouge">Result</code> and <code class="language-plaintext highlighter-rouge">?</code>.</p>

<h2 id="document-information">Document Information</h2>

<ul>
  <li>Written on: 2026-04-15</li>
  <li>Verification date: 2026-04-16</li>
  <li>Document type: tutorial</li>
  <li>Test environment: Windows 11 Pro, Windows PowerShell, Cargo CLI examples</li>
  <li>Test version: rustc 1.94.0, cargo 1.94.0</li>
  <li>Source grade: only official documentation is used.</li>
  <li>Note: this post covers only the basic pattern for small UTF-8 text CLI tools and leaves out large-file streaming and advanced argument parsers.</li>
</ul>

<h2 id="problem-definition">Problem Definition</h2>

<p>It is hard to feel that you have really learned Rust if every example ends at syntax. Real beginner tools usually need a flow like this:</p>

<ul>
  <li>accept a path or option from the command line</li>
  <li>read a file</li>
  <li>process the contents</li>
  <li>return the result to the terminal or another file</li>
</ul>

<p>This post stays with the most basic text-file workflow. Large-file streaming, binary data, and advanced argument parsing libraries are intentionally left out.</p>

<h2 id="verified-facts">Verified Facts</h2>

<ul>
  <li>According to the standard library docs, <code class="language-plaintext highlighter-rouge">std::env::args</code> returns an iterator over the command-line arguments of the current process, and the first item is typically the program path.
Evidence: <a href="https://doc.rust-lang.org/std/env/fn.args.html">std::env::args</a></li>
  <li>According to the standard library docs, <code class="language-plaintext highlighter-rouge">std::fs::read_to_string</code> reads a whole file into a <code class="language-plaintext highlighter-rouge">String</code>, and non-UTF-8 data can cause an error.
Evidence: <a href="https://doc.rust-lang.org/std/fs/fn.read_to_string.html">std::fs::read_to_string</a></li>
  <li>According to the standard library docs, <code class="language-plaintext highlighter-rouge">std::fs::write</code> writes a byte sequence to a file, creating it if needed and replacing the full contents if it already exists.
Evidence: <a href="https://doc.rust-lang.org/std/fs/fn.write.html">std::fs::write</a></li>
  <li>According to the official Rust Book, <code class="language-plaintext highlighter-rouge">Result</code> and the <code class="language-plaintext highlighter-rouge">?</code> operator form the basic pattern for propagating recoverable errors to the caller.
Evidence: <a href="https://doc.rust-lang.org/book/ch09-02-recoverable-errors-with-result.html">Recoverable Errors with Result</a></li>
</ul>

<p>One of the smallest useful examples is a CLI that counts lines and words in a text file.</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">use</span> <span class="nn">std</span><span class="p">::{</span><span class="n">env</span><span class="p">,</span> <span class="nn">error</span><span class="p">::</span><span class="n">Error</span><span class="p">,</span> <span class="n">fs</span><span class="p">,</span> <span class="n">io</span><span class="p">};</span>

<span class="k">fn</span> <span class="nf">count_lines_and_words</span><span class="p">(</span><span class="n">text</span><span class="p">:</span> <span class="o">&amp;</span><span class="nb">str</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="p">(</span><span class="nb">usize</span><span class="p">,</span> <span class="nb">usize</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">let</span> <span class="n">lines</span> <span class="o">=</span> <span class="n">text</span><span class="nf">.lines</span><span class="p">()</span><span class="nf">.count</span><span class="p">();</span>
    <span class="k">let</span> <span class="n">words</span> <span class="o">=</span> <span class="n">text</span><span class="nf">.split_whitespace</span><span class="p">()</span><span class="nf">.count</span><span class="p">();</span>
    <span class="p">(</span><span class="n">lines</span><span class="p">,</span> <span class="n">words</span><span class="p">)</span>
<span class="p">}</span>

<span class="k">fn</span> <span class="nf">main</span><span class="p">()</span> <span class="k">-&gt;</span> <span class="nb">Result</span><span class="o">&lt;</span><span class="p">(),</span> <span class="nb">Box</span><span class="o">&lt;</span><span class="k">dyn</span> <span class="n">Error</span><span class="o">&gt;&gt;</span> <span class="p">{</span>
    <span class="k">let</span> <span class="n">path</span> <span class="o">=</span> <span class="nn">env</span><span class="p">::</span><span class="nf">args</span><span class="p">()</span><span class="nf">.nth</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span><span class="nf">.ok_or_else</span><span class="p">(||</span> <span class="p">{</span>
        <span class="nn">io</span><span class="p">::</span><span class="nn">Error</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="nn">io</span><span class="p">::</span><span class="nn">ErrorKind</span><span class="p">::</span><span class="n">InvalidInput</span><span class="p">,</span> <span class="s">"usage: cargo run -- &lt;file-path&gt;"</span><span class="p">)</span>
    <span class="p">})</span><span class="o">?</span><span class="p">;</span>

    <span class="k">let</span> <span class="n">contents</span> <span class="o">=</span> <span class="nn">fs</span><span class="p">::</span><span class="nf">read_to_string</span><span class="p">(</span><span class="o">&amp;</span><span class="n">path</span><span class="p">)</span><span class="o">?</span><span class="p">;</span>
    <span class="k">let</span> <span class="p">(</span><span class="n">lines</span><span class="p">,</span> <span class="n">words</span><span class="p">)</span> <span class="o">=</span> <span class="nf">count_lines_and_words</span><span class="p">(</span><span class="o">&amp;</span><span class="n">contents</span><span class="p">);</span>

    <span class="k">let</span> <span class="n">summary</span> <span class="o">=</span> <span class="nd">format!</span><span class="p">(</span><span class="s">"file = {}</span><span class="se">\n</span><span class="s">lines = {}</span><span class="se">\n</span><span class="s">words = {}</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">path</span><span class="p">,</span> <span class="n">lines</span><span class="p">,</span> <span class="n">words</span><span class="p">);</span>

    <span class="nd">println!</span><span class="p">(</span><span class="s">"{}"</span><span class="p">,</span> <span class="n">summary</span><span class="p">);</span>
    <span class="nn">fs</span><span class="p">::</span><span class="nf">write</span><span class="p">(</span><span class="s">"summary.txt"</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">summary</span><span class="p">)</span><span class="o">?</span><span class="p">;</span>

    <span class="nf">Ok</span><span class="p">(())</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The key reading order for that code is:</p>

<ol>
  <li>Use <code class="language-plaintext highlighter-rouge">env::args().nth(1)</code> to get the first real argument.</li>
  <li>If it does not exist, create an <code class="language-plaintext highlighter-rouge">io::Error</code> and return it through <code class="language-plaintext highlighter-rouge">Result</code>.</li>
  <li>Read the whole file with <code class="language-plaintext highlighter-rouge">fs::read_to_string</code>.</li>
  <li>Keep the core calculation in a small function like <code class="language-plaintext highlighter-rouge">count_lines_and_words</code>.</li>
  <li>Let <code class="language-plaintext highlighter-rouge">main</code> handle output and saving.</li>
</ol>

<p>That structure matters because it connects naturally to testing. Once the actual text-processing logic is separated from file I/O, it can be tested without creating files at all.</p>

<h2 id="directly-confirmed-results">Directly Confirmed Results</h2>

<ul>
  <li>Directly confirmed result: the Rust toolchain versions available in the current writing environment were:</li>
</ul>

<div class="language-powershell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">rustc</span><span class="w"> </span><span class="nt">--version</span><span class="w">
</span><span class="n">cargo</span><span class="w"> </span><span class="nt">--version</span><span class="w">
</span></code></pre></div></div>

<ul>
  <li>Observed output:</li>
</ul>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>rustc 1.94.0 (4a4ef493e 2026-03-02)
cargo 1.94.0 (85eff7c80 2026-01-15)
</code></pre></div></div>

<ul>
  <li>Directly confirmed result: when I placed the following <code class="language-plaintext highlighter-rouge">sample.txt</code> next to the example and ran the program, the result was:</li>
</ul>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Rust makes tools practical.
Rust makes testing easier.
</code></pre></div></div>

<div class="language-powershell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cargo</span><span class="w"> </span><span class="nx">run</span><span class="w"> </span><span class="nt">--quiet</span><span class="w"> </span><span class="o">--</span><span class="w"> </span><span class="nx">sample.txt</span><span class="w">
</span></code></pre></div></div>

<ul>
  <li>Observed output:</li>
</ul>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>file = sample.txt
lines = 2
words = 8
</code></pre></div></div>

<ul>
  <li>Directly confirmed result: the generated <code class="language-plaintext highlighter-rouge">summary.txt</code> contained:</li>
</ul>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>file = sample.txt
lines = 2
words = 8
</code></pre></div></div>

<ul>
  <li>Limitation of direct reproduction: I reproduced the flow with a sample file in a temporary Cargo project, but I did not validate larger files, non-UTF-8 input, or additional error paths.</li>
</ul>

<h2 id="interpretation--opinion">Interpretation / Opinion</h2>

<ul>
  <li>My view is that the first important CLI skill is not a powerful parser library, but clean input/output boundaries.</li>
  <li>Opinion: keep file reading, argument handling, and user-facing errors in <code class="language-plaintext highlighter-rouge">main</code>, while moving text processing into separate functions.</li>
  <li>Opinion: for small UTF-8 text files, <code class="language-plaintext highlighter-rouge">read_to_string</code> is the simplest place to start. Going too early into buffering and streaming can hide the larger structure beginners need to learn first.</li>
</ul>

<h2 id="limits-and-exceptions">Limits and Exceptions</h2>

<ul>
  <li><code class="language-plaintext highlighter-rouge">std::env::args</code> has Unicode-related limitations, so <code class="language-plaintext highlighter-rouge">args_os</code> may be more appropriate when non-Unicode input matters.</li>
  <li><code class="language-plaintext highlighter-rouge">read_to_string</code> loads the whole file into memory, which may not fit very large files or binary data.</li>
  <li><code class="language-plaintext highlighter-rouge">write</code> replaces the existing contents, so append-style output needs a different API.</li>
  <li>Real CLI tools may need crates like <code class="language-plaintext highlighter-rouge">clap</code>, <code class="language-plaintext highlighter-rouge">BufRead</code>, and more specific error types, but this post intentionally stays with the simplest standard-library pattern.</li>
</ul>

<h2 id="references">References</h2>

<ul>
  <li><a href="https://doc.rust-lang.org/std/env/fn.args.html">std::env::args</a></li>
  <li><a href="https://doc.rust-lang.org/std/fs/fn.read_to_string.html">std::fs::read_to_string</a></li>
  <li><a href="https://doc.rust-lang.org/std/fs/fn.write.html">std::fs::write</a></li>
  <li><a href="https://doc.rust-lang.org/book/ch09-02-recoverable-errors-with-result.html">Recoverable Errors with Result</a></li>
  <li><a href="https://doc.rust-lang.org/book/ch12-01-accepting-command-line-arguments.html">Accepting Command Line Arguments</a></li>
</ul>]]></content><author><name>K4nul</name></author><category term="Rust" /><category term="rust" /><category term="file-io" /><category term="cli" /><category term="std-fs" /><category term="env-args" /><summary type="html"><![CDATA[Rust guide to file and CLI input with std::fs::read_to_string, write, std::env::args, and Result.]]></summary></entry><entry xml:lang="ko"><title type="html">Rust 11. File I/O and Command-Line Input</title><link href="https://www.k4nul.com/rust/rust-file-io-and-cli-input/" rel="alternate" type="text/html" title="Rust 11. File I/O and Command-Line Input" /><published>2026-04-18T09:00:00+09:00</published><updated>2026-04-18T09:00:00+09:00</updated><id>https://www.k4nul.com/rust/rust-file-io-and-cli-input</id><content type="html" xml:base="https://www.k4nul.com/rust/rust-file-io-and-cli-input/"><![CDATA[<h2 id="요약">요약</h2>

<p>Rust 기초 문법과 컬렉션을 익힌 뒤 실제 도구를 만들기 시작하면 결국 파일을 읽고, 인자를 받고, 에러를 처리해야 한다. 이 지점에서 <code class="language-plaintext highlighter-rouge">std::env::args</code>, <code class="language-plaintext highlighter-rouge">std::fs::read_to_string</code>, <code class="language-plaintext highlighter-rouge">std::fs::write</code>, <code class="language-plaintext highlighter-rouge">Result</code>를 한 흐름으로 이해하면 작은 CLI 프로그램을 만드는 장벽이 크게 낮아진다.</p>

<p>이 글은 텍스트 파일을 읽고, 커맨드라인 인자를 받아서, 간단한 요약을 출력하는 가장 기본적인 패턴을 정리한다. 결론부터 말하면 초급 단계에서는 “인자는 <code class="language-plaintext highlighter-rouge">main</code>에서 받고, 파일은 표준 라이브러리로 읽고, 핵심 계산은 별도 함수로 빼고, 실패는 <code class="language-plaintext highlighter-rouge">Result</code>와 <code class="language-plaintext highlighter-rouge">?</code>로 위로 올린다”는 흐름이 가장 단순하고 유지보수하기 쉽다.</p>

<h2 id="문서-정보">문서 정보</h2>

<ul>
  <li>작성일: 2026-04-15</li>
  <li>검증 기준일: 2026-04-16</li>
  <li>문서 성격: tutorial</li>
  <li>테스트 환경: Windows 11 Pro, Windows PowerShell, Cargo CLI 예시</li>
  <li>테스트 버전: rustc 1.94.0, cargo 1.94.0</li>
  <li>출처 등급: 공식 문서만 사용했다.</li>
  <li>비고: 이 글은 작은 UTF-8 텍스트 CLI 기준의 기본 패턴만 다루며, 대용량 스트리밍 처리나 고급 인자 파서는 범위에서 제외한다.</li>
</ul>

<h2 id="문제-정의">문제 정의</h2>

<p>문법 설명만으로는 Rust를 배웠다고 느끼기 어렵다. 실제로는 아래 흐름이 들어가야 비로소 “도구 하나를 끝까지 만들었다”는 감각이 생긴다.</p>

<ul>
  <li>파일 경로나 옵션을 커맨드라인에서 받는다.</li>
  <li>파일을 읽는다.</li>
  <li>내용을 가공한다.</li>
  <li>결과를 화면이나 다른 파일로 돌려준다.</li>
</ul>

<p>이번 글은 이 흐름 중에서도 가장 기본적인 텍스트 파일 처리 패턴에 집중한다. 대용량 파일 스트리밍, binary 파일, 정교한 인자 파싱 라이브러리 사용은 범위에서 제외한다.</p>

<h2 id="확인된-사실">확인된 사실</h2>

<ul>
  <li>표준 라이브러리 문서 기준으로 <code class="language-plaintext highlighter-rouge">std::env::args</code>는 현재 프로세스의 커맨드라인 인자를 iterator로 반환하며, 첫 번째 값은 보통 프로그램 경로다.
근거: <a href="https://doc.rust-lang.org/std/env/fn.args.html">std::env::args</a></li>
  <li>표준 라이브러리 문서 기준으로 <code class="language-plaintext highlighter-rouge">std::fs::read_to_string</code>은 파일 전체를 읽어 <code class="language-plaintext highlighter-rouge">String</code>으로 반환하고, UTF-8이 아닌 데이터는 에러가 될 수 있다.
근거: <a href="https://doc.rust-lang.org/std/fs/fn.read_to_string.html">std::fs::read_to_string</a></li>
  <li>표준 라이브러리 문서 기준으로 <code class="language-plaintext highlighter-rouge">std::fs::write</code>는 바이트 시퀀스를 파일에 기록하며, 파일이 없으면 만들고 있으면 전체 내용을 교체한다.
근거: <a href="https://doc.rust-lang.org/std/fs/fn.write.html">std::fs::write</a></li>
  <li>공식 문서 기준으로 <code class="language-plaintext highlighter-rouge">Result</code>와 <code class="language-plaintext highlighter-rouge">?</code> 연산자는 에러를 호출자에게 전달하는 기본 패턴이다.
근거: <a href="https://doc.rust-lang.org/book/ch09-02-recoverable-errors-with-result.html">Recoverable Errors with Result</a></li>
</ul>

<p>가장 작은 예제는 아래처럼 파일에서 줄 수와 단어 수를 세는 CLI다.</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">use</span> <span class="nn">std</span><span class="p">::{</span><span class="n">env</span><span class="p">,</span> <span class="nn">error</span><span class="p">::</span><span class="n">Error</span><span class="p">,</span> <span class="n">fs</span><span class="p">,</span> <span class="n">io</span><span class="p">};</span>

<span class="k">fn</span> <span class="nf">count_lines_and_words</span><span class="p">(</span><span class="n">text</span><span class="p">:</span> <span class="o">&amp;</span><span class="nb">str</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="p">(</span><span class="nb">usize</span><span class="p">,</span> <span class="nb">usize</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">let</span> <span class="n">lines</span> <span class="o">=</span> <span class="n">text</span><span class="nf">.lines</span><span class="p">()</span><span class="nf">.count</span><span class="p">();</span>
    <span class="k">let</span> <span class="n">words</span> <span class="o">=</span> <span class="n">text</span><span class="nf">.split_whitespace</span><span class="p">()</span><span class="nf">.count</span><span class="p">();</span>
    <span class="p">(</span><span class="n">lines</span><span class="p">,</span> <span class="n">words</span><span class="p">)</span>
<span class="p">}</span>

<span class="k">fn</span> <span class="nf">main</span><span class="p">()</span> <span class="k">-&gt;</span> <span class="nb">Result</span><span class="o">&lt;</span><span class="p">(),</span> <span class="nb">Box</span><span class="o">&lt;</span><span class="k">dyn</span> <span class="n">Error</span><span class="o">&gt;&gt;</span> <span class="p">{</span>
    <span class="k">let</span> <span class="n">path</span> <span class="o">=</span> <span class="nn">env</span><span class="p">::</span><span class="nf">args</span><span class="p">()</span><span class="nf">.nth</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span><span class="nf">.ok_or_else</span><span class="p">(||</span> <span class="p">{</span>
        <span class="nn">io</span><span class="p">::</span><span class="nn">Error</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="nn">io</span><span class="p">::</span><span class="nn">ErrorKind</span><span class="p">::</span><span class="n">InvalidInput</span><span class="p">,</span> <span class="s">"usage: cargo run -- &lt;file-path&gt;"</span><span class="p">)</span>
    <span class="p">})</span><span class="o">?</span><span class="p">;</span>

    <span class="k">let</span> <span class="n">contents</span> <span class="o">=</span> <span class="nn">fs</span><span class="p">::</span><span class="nf">read_to_string</span><span class="p">(</span><span class="o">&amp;</span><span class="n">path</span><span class="p">)</span><span class="o">?</span><span class="p">;</span>
    <span class="k">let</span> <span class="p">(</span><span class="n">lines</span><span class="p">,</span> <span class="n">words</span><span class="p">)</span> <span class="o">=</span> <span class="nf">count_lines_and_words</span><span class="p">(</span><span class="o">&amp;</span><span class="n">contents</span><span class="p">);</span>

    <span class="k">let</span> <span class="n">summary</span> <span class="o">=</span> <span class="nd">format!</span><span class="p">(</span><span class="s">"file = {}</span><span class="se">\n</span><span class="s">lines = {}</span><span class="se">\n</span><span class="s">words = {}</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">path</span><span class="p">,</span> <span class="n">lines</span><span class="p">,</span> <span class="n">words</span><span class="p">);</span>

    <span class="nd">println!</span><span class="p">(</span><span class="s">"{}"</span><span class="p">,</span> <span class="n">summary</span><span class="p">);</span>
    <span class="nn">fs</span><span class="p">::</span><span class="nf">write</span><span class="p">(</span><span class="s">"summary.txt"</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">summary</span><span class="p">)</span><span class="o">?</span><span class="p">;</span>

    <span class="nf">Ok</span><span class="p">(())</span>
<span class="p">}</span>
</code></pre></div></div>

<p>이 코드를 읽는 핵심 순서는 아래와 같다.</p>

<ol>
  <li><code class="language-plaintext highlighter-rouge">env::args().nth(1)</code>로 첫 번째 실제 인자를 꺼낸다.</li>
  <li>인자가 없으면 <code class="language-plaintext highlighter-rouge">io::Error</code>를 만들어 <code class="language-plaintext highlighter-rouge">Result</code>로 반환한다.</li>
  <li><code class="language-plaintext highlighter-rouge">fs::read_to_string</code>으로 파일 전체를 읽는다.</li>
  <li>핵심 계산은 <code class="language-plaintext highlighter-rouge">count_lines_and_words</code> 같은 작은 함수에서 처리한다.</li>
  <li>화면 출력과 파일 저장은 <code class="language-plaintext highlighter-rouge">main</code>에서 마무리한다.</li>
</ol>

<p>이 구조가 중요한 이유는 이후 <code class="language-plaintext highlighter-rouge">testing</code>과도 자연스럽게 연결되기 때문이다. 실제 계산 로직이 순수 함수에 있으면 파일 입출력 없이도 테스트할 수 있다.</p>

<h2 id="직접-확인한-결과">직접 확인한 결과</h2>

<ul>
  <li>직접 확인한 결과: 현재 작성 환경에서 Rust toolchain 버전은 아래와 같았다.</li>
</ul>

<div class="language-powershell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">rustc</span><span class="w"> </span><span class="nt">--version</span><span class="w">
</span><span class="n">cargo</span><span class="w"> </span><span class="nt">--version</span><span class="w">
</span></code></pre></div></div>

<ul>
  <li>관찰된 결과:</li>
</ul>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>rustc 1.94.0 (4a4ef493e 2026-03-02)
cargo 1.94.0 (85eff7c80 2026-01-15)
</code></pre></div></div>

<ul>
  <li>직접 확인한 결과: 아래처럼 <code class="language-plaintext highlighter-rouge">sample.txt</code>를 두고 본문 예제를 실행했을 때 결과는 아래와 같았다.</li>
</ul>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Rust makes tools practical.
Rust makes testing easier.
</code></pre></div></div>

<div class="language-powershell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cargo</span><span class="w"> </span><span class="nx">run</span><span class="w"> </span><span class="nt">--quiet</span><span class="w"> </span><span class="o">--</span><span class="w"> </span><span class="nx">sample.txt</span><span class="w">
</span></code></pre></div></div>

<ul>
  <li>관찰된 결과:</li>
</ul>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>file = sample.txt
lines = 2
words = 8
</code></pre></div></div>

<ul>
  <li>직접 확인한 결과: 같은 실행에서 생성된 <code class="language-plaintext highlighter-rouge">summary.txt</code> 내용은 아래와 같았다.</li>
</ul>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>file = sample.txt
lines = 2
words = 8
</code></pre></div></div>

<ul>
  <li>직접 확인 범위의 한계: 대표 입력 파일과 임시 Cargo 프로젝트로 예제 흐름은 재현했지만, 큰 파일, 비UTF-8 입력, 추가 에러 케이스까지는 검증하지 않았다.</li>
</ul>

<h2 id="해석--의견">해석 / 의견</h2>

<ul>
  <li>내 판단으로는 초급 CLI에서 가장 먼저 익혀야 할 것은 고급 인자 파서가 아니라 입출력 경계 분리다.</li>
  <li>의견: 파일 읽기, 인자 읽기, 에러 출력은 <code class="language-plaintext highlighter-rouge">main</code>에 두고, 텍스트 가공이나 계산은 별도 함수로 빼면 테스트와 재사용이 쉬워진다.</li>
  <li>의견: 작은 UTF-8 텍스트 파일을 다루는 단계에서는 <code class="language-plaintext highlighter-rouge">read_to_string</code>이 가장 단순하다. 너무 이른 시점에 스트리밍과 버퍼 최적화까지 넣으면 오히려 구조 감각을 놓치기 쉽다.</li>
</ul>

<h2 id="한계와-예외">한계와 예외</h2>

<ul>
  <li><code class="language-plaintext highlighter-rouge">std::env::args</code>는 유니코드가 아닌 인자 처리에 제약이 있어, 그런 입력이 필요하면 <code class="language-plaintext highlighter-rouge">args_os</code>를 검토해야 한다.</li>
  <li><code class="language-plaintext highlighter-rouge">read_to_string</code>은 파일 전체를 메모리에 올리므로 매우 큰 파일이나 binary 파일에는 적합하지 않을 수 있다.</li>
  <li><code class="language-plaintext highlighter-rouge">write</code>는 기존 파일 내용을 덮어쓰므로 append가 필요한 경우에는 다른 API가 필요하다.</li>
  <li>실제 CLI 도구에서는 <code class="language-plaintext highlighter-rouge">clap</code> 같은 라이브러리, <code class="language-plaintext highlighter-rouge">BufRead</code>, 더 구체적인 에러 타입이 필요할 수 있지만 이번 글에서는 가장 단순한 표준 라이브러리 패턴만 다뤘다.</li>
</ul>

<h2 id="참고자료">참고자료</h2>

<ul>
  <li><a href="https://doc.rust-lang.org/std/env/fn.args.html">std::env::args</a></li>
  <li><a href="https://doc.rust-lang.org/std/fs/fn.read_to_string.html">std::fs::read_to_string</a></li>
  <li><a href="https://doc.rust-lang.org/std/fs/fn.write.html">std::fs::write</a></li>
  <li><a href="https://doc.rust-lang.org/book/ch09-02-recoverable-errors-with-result.html">Recoverable Errors with Result</a></li>
  <li><a href="https://doc.rust-lang.org/book/ch12-01-accepting-command-line-arguments.html">Accepting Command Line Arguments</a></li>
</ul>]]></content><author><name>K4nul</name></author><category term="Rust" /><category term="rust" /><category term="file-io" /><category term="cli" /><category term="std-fs" /><category term="env-args" /><summary type="html"><![CDATA[std::fs::read_to_string, write, std::env::args, Result를 이용해 파일과 CLI 입력을 다루는 Rust 가이드.]]></summary></entry><entry xml:lang="en"><title type="html">Harness Engineering 09. Agent Systems Need Approval Boundaries and Guardrails</title><link href="https://www.k4nul.com/en/ai/approval-boundaries-and-guardrails/" rel="alternate" type="text/html" title="Harness Engineering 09. Agent Systems Need Approval Boundaries and Guardrails" /><published>2026-04-18T00:00:00+09:00</published><updated>2026-04-18T00:00:00+09:00</updated><id>https://www.k4nul.com/en/ai/approval-boundaries-and-guardrails-en</id><content type="html" xml:base="https://www.k4nul.com/en/ai/approval-boundaries-and-guardrails/"><![CDATA[<h2 id="summary">Summary</h2>

<p>If a system has many rules and still feels unsafe, the missing piece is often the boundary. A team may have role separation, verification rules, and good documentation principles, yet some actions still execute too easily and some inputs are trusted too quickly. That is why this post focuses on the difference between good operating documents and safe execution boundaries, and on why approval boundaries and guardrails need to be part of an agent system from the beginning.</p>

<h2 id="verification-scope-and-interpretation-boundary">Verification scope and interpretation boundary</h2>

<ul>
  <li>As of: April 15, 2026, checked OpenAI Codex docs, OpenAI platform docs, and Anthropic Claude Code docs.</li>
  <li>Source grade: official docs first, plus vendor-authored engineering posts only when a concept is introduced there.</li>
  <li>Fact boundary: only documented features such as <code class="language-plaintext highlighter-rouge">AGENTS.md</code>, memory/settings, hooks, subagents, approvals, sandboxing, evals, and trace are treated as factual claims.</li>
  <li>Interpretation boundary: terms like <code class="language-plaintext highlighter-rouge">harness engineering</code>, <code class="language-plaintext highlighter-rouge">control plane</code>, <code class="language-plaintext highlighter-rouge">contract</code>, <code class="language-plaintext highlighter-rouge">enforcement</code>, and <code class="language-plaintext highlighter-rouge">observable harness</code> are operating abstractions used in this series unless a source line says otherwise.</li>
</ul>

<h2 id="what-is-an-approval-boundary">What Is an Approval Boundary?</h2>

<p>An approval boundary defines where an agent may act on its own and where human approval becomes mandatory. This is more than a permission toggle. It is a design decision about where responsibility changes hands.
Documented fact: OpenAI documents agent approvals as a distinct security layer, and Anthropic documents permissions as the layer that defines what actions are allowed(<a href="https://developers.openai.com/codex/agent-approvals-security">Agent approvals &amp; security</a>, <a href="https://code.claude.com/docs/en/permissions">Claude permissions</a>)</p>

<p>The easiest examples are destructive actions. Deletion, deployment, and large-scale modification all have consequences that are hard to undo or broad in impact. A change that overwrites hundreds of files, deploys to production, or deletes existing data should usually require human approval before it proceeds. Writing “be careful” in documentation is not the same thing as making the system stop until a person approves the action.</p>

<h2 id="what-do-guardrails-prevent">What Do Guardrails Prevent?</h2>

<p>What I mean here by guardrails is a broad practical label rather than a perfectly standardized vendor term. Official documentation often breaks this down into more specific mechanisms such as permissions, hooks, input or output guardrails, and trust verification. But the shared idea is still the same: mechanisms that stop an agent before it goes too far in the wrong direction or that limit risky flows. If approval boundaries define where escalation is required, guardrails define what kinds of actions or flows are not freely allowed in the first place.
Documented fact: OpenAI explains guardrails, policy checks, and execution limits in its safety and sandboxing docs, while Anthropic explains control points through hooks, sandboxing, and security guidance(<a href="https://developers.openai.com/api/docs/guides/agent-builder-safety">Safety in building agents</a>, <a href="https://developers.openai.com/codex/concepts/sandboxing">OpenAI sandboxing</a>, <a href="https://code.claude.com/docs/en/hooks">Claude hooks</a>, <a href="https://code.claude.com/docs/en/sandboxing">Claude sandboxing</a>, <a href="https://code.claude.com/docs/en/security">Claude security</a>)
Interpretation: in this post, <code class="language-plaintext highlighter-rouge">guardrail</code> is a practical umbrella term that groups those mechanisms together.</p>

<p>One clear example is preventing untrusted input from being treated as an executable instruction. External documents, issue bodies, web content, and data arriving through connectors or MCP can all be useful context. But they should not directly push agent behavior as if they were system instructions. A sentence found in an outside document should not immediately lead to deletion, deployment, or privilege expansion. External text can provide context, but it should not automatically inherit execution authority.</p>

<h2 id="where-does-this-matter-most">Where Does This Matter Most?</h2>

<p>Approval and guardrails matter especially in three areas.
Interpretation: the priority ordering in this section is my own operating judgment. The official docs still support the general direction by treating risky tool use, outside input, permission boundaries, and sandboxing as separate safety concerns(<a href="https://developers.openai.com/codex/agent-approvals-security">Agent approvals &amp; security</a>, <a href="https://developers.openai.com/api/docs/guides/agent-builder-safety">Safety in building agents</a>, <a href="https://code.claude.com/docs/en/permissions">Claude permissions</a>, <a href="https://code.claude.com/docs/en/sandboxing">Claude sandboxing</a>)</p>

<p>First, destructive action. Deletion, forceful overwrite, bulk editing, and deployment need clear approval boundaries. Second, external data and untrusted input. This is exactly why outside text should not be treated as direct execution guidance. It may be a factual claim or a suggestion, but it is not authority. Third, connector and MCP usage. In those cases, the system should at least record what was read, which tool was called, and how the result influenced the final decision.</p>

<p>For connector or MCP usage, a minimal log often helps: <code class="language-plaintext highlighter-rouge">source</code>, <code class="language-plaintext highlighter-rouge">tool_name</code>, <code class="language-plaintext highlighter-rouge">query_or_resource</code>, <code class="language-plaintext highlighter-rouge">timestamp</code>, and <code class="language-plaintext highlighter-rouge">decision_usage</code>. This is not a formal standard schema so much as a practical example of the minimum data that helps trace why external input led to a particular action. The important point is not merely that an external connector was used, but that its impact on the decision path remains visible.</p>

<h2 id="what-i-came-to-see-as-more-important-later">What I Came to See as More Important, Later</h2>

<p>Early on, I spent more attention on role separation, verification, and documentation rules. Those questions were easier to see first: who should do what, which tests should exist, and which documents should be updated. Approval and guardrails felt more like optional additions that could be refined later.
Interpretation: this section describes how I came to rank approval boundaries and guardrails more highly. The factual anchor is that approvals, permissions, hooks, and sandboxing are all documented as separate operating layers(<a href="https://developers.openai.com/codex/agent-approvals-security">Agent approvals &amp; security</a>, <a href="https://developers.openai.com/codex/hooks">OpenAI hooks</a>, <a href="https://code.claude.com/docs/en/permissions">Claude permissions</a>, <a href="https://code.claude.com/docs/en/hooks">Claude hooks</a>)</p>

<p>Over time, it became much clearer that they were not optional at all. Even with good rules, risky actions can happen too easily if the approval boundary is empty. If guardrails are weak, outside input can end up carrying more weight than intended. A good operating document and a safe execution boundary are not the same thing.</p>

<h2 id="wrap-up">Wrap-up</h2>

<p>Any agent-first operating document needs approval boundaries and guardrails alongside it. You have to define which tool calls require human approval, where destructive actions must stop, how untrusted text and outside input should be handled, and what needs to be recorded when connectors or MCP are used. In harness engineering, the goal is not just to write many rules, but to make sure those rules operate inside safe execution boundaries.</p>

<p>The next post will turn this into a practical transition roadmap. The question there is how to move from document-centered operations toward an observable harness in a realistic sequence.</p>

<h2 id="one-sentence-summary">One-Sentence Summary</h2>

<p>Good operating rules alone do not make an agent system safe; approval boundaries and guardrails must be designed as part of the execution boundary from the start.</p>

<h2 id="preview-of-the-next-post">Preview of the Next Post</h2>

<p>The next post will look at a practical roadmap for moving from document-centered operations to an observable harness. The problems in this series are not isolated issues; in practice they form a transition path that has to be handled step by step. So the next piece will focus on sequencing: what should move out of documents first, what should be lifted into schema and eval, and what should be strengthened through trace and guardrails. The goal is not to create a perfect harness all at once, but to move toward one realistically. That is where the series shifts from concept explanation into practical adoption order.</p>

<h2 id="sources-and-references">Sources and references</h2>

<ul>
  <li>OpenAI, <a href="https://developers.openai.com/codex/agent-approvals-security">Agent approvals &amp; security</a></li>
  <li>OpenAI, <a href="https://developers.openai.com/codex/concepts/sandboxing">Sandboxing</a></li>
  <li>OpenAI, <a href="https://developers.openai.com/api/docs/guides/agent-builder-safety">Safety in building agents</a></li>
  <li>Anthropic, <a href="https://code.claude.com/docs/en/permissions">Permissions</a></li>
  <li>Anthropic, <a href="https://code.claude.com/docs/en/hooks">Hooks reference</a></li>
  <li>Anthropic, <a href="https://code.claude.com/docs/en/sandboxing">Sandboxing</a></li>
  <li>Anthropic, <a href="https://code.claude.com/docs/en/security">Security</a></li>
</ul>]]></content><author><name>K4nul</name></author><category term="AI" /><category term="ai" /><category term="llm" /><category term="harness-engineering" /><category term="guardrail" /><category term="approval" /><category term="mcp" /><summary type="html"><![CDATA[Explains why agent systems become unstable when approval boundaries and guardrails are missing.]]></summary></entry><entry xml:lang="ko"><title type="html">하네스 엔지니어링 09. approval과 guardrail이 비어 있으면 agent 시스템은 불안정하다</title><link href="https://www.k4nul.com/ai/approval-boundaries-and-guardrails/" rel="alternate" type="text/html" title="하네스 엔지니어링 09. approval과 guardrail이 비어 있으면 agent 시스템은 불안정하다" /><published>2026-04-18T00:00:00+09:00</published><updated>2026-04-18T00:00:00+09:00</updated><id>https://www.k4nul.com/ai/approval-boundaries-and-guardrails</id><content type="html" xml:base="https://www.k4nul.com/ai/approval-boundaries-and-guardrails/"><![CDATA[<h2 id="요약">요약</h2>

<p>규칙은 많았는데도 왠지 불안하다면, 경계가 비어 있을 가능성이 크다. 역할 분리도 있고 검증 규칙도 있고 문서 원칙도 잘 적혀 있는데, 막상 운영해 보면 어떤 행동은 너무 쉽게 실행되고 어떤 입력은 너무 쉽게 믿어지는 경우가 있다. 그래서 이번 글에서는 좋은 운영 문서와 안전한 실행 경계가 왜 다른 문제인지, 그리고 approval boundary와 guardrail이 왜 agent 시스템에 처음부터 필요했는지를 이야기해보려 한다.</p>

<h2 id="검증-기준과-해석-경계">검증 기준과 해석 경계</h2>

<ul>
  <li>시점: 2026-04-15 기준 OpenAI Codex 문서, OpenAI 공식 API 문서, Anthropic Claude Code 문서를 확인했다.</li>
  <li>출처 등급: 공식 문서 우선, 개념을 처음 소개한 vendor-authored 글만 보조로 사용했다.</li>
  <li>사실 범위: <code class="language-plaintext highlighter-rouge">AGENTS.md</code>, memory/settings, hooks, subagents, approvals, sandboxing, eval, trace처럼 문서로 확인 가능한 기능만 사실로 적었다.</li>
  <li>해석 범위: <code class="language-plaintext highlighter-rouge">harness engineering</code>, <code class="language-plaintext highlighter-rouge">control plane</code>, <code class="language-plaintext highlighter-rouge">contract</code>, <code class="language-plaintext highlighter-rouge">enforcement</code>, <code class="language-plaintext highlighter-rouge">observable harness</code> 같은 표현은 이 시리즈에서 쓰는 운영 개념이며, 별도 근거 줄이 없는 경우 필자의 해석이다.</li>
</ul>

<h2 id="approval-boundary란-무엇인가">approval boundary란 무엇인가</h2>

<p>approval boundary는 “어떤 행동까지는 agent가 스스로 해도 되고, 어떤 행동부터는 사람 승인이 필요하다”는 경계를 뜻한다. 이것은 단순한 권한 설정이 아니라, 책임이 넘어가는 지점을 명확히 하는 설계에 가깝다.
문서로 확인 가능한 사실: OpenAI는 agent approvals를 별도 보안 계층으로 설명하고, Anthropic은 permissions를 통해 어떤 행동이 허용되는지 문서화한다(<a href="https://developers.openai.com/codex/agent-approvals-security">Agent approvals &amp; security</a>, <a href="https://code.claude.com/docs/en/permissions">Claude permissions</a>)</p>

<p>가장 쉬운 예는 destructive action이다. 삭제, 배포, 대량 수정 같은 작업은 결과가 되돌리기 어렵거나 영향 범위가 넓다. 예를 들어 수백 개 파일을 한꺼번에 바꾸는 작업, 운영 환경 배포, 기존 데이터를 지우는 작업은 agent가 규칙을 잘 이해하고 있더라도 자동 실행보다 사람 승인을 먼저 두는 편이 안전하다. 좋은 문서에 “조심해라”라고 적는 것과, 실제로 승인 없이는 못 넘어가게 만드는 것은 전혀 다른 층의 이야기다.</p>

<h2 id="guardrail은-무엇을-막는가">guardrail은 무엇을 막는가</h2>

<p>여기서 말하는 guardrail은 실무에서 넓게 묶어 부르는 표현에 가깝다. 공식 문서에서는 permissions, hooks, input/output guardrails, trust verification처럼 조금 더 나뉜 장치로 설명되기도 한다. 다만 공통적으로는 agent가 잘못된 방향으로 너무 멀리 가기 전에 멈추게 하거나, 위험한 흐름을 제한하는 역할을 가리킨다. approval boundary가 승인 지점을 정한다면, guardrail은 아예 허용되지 않는 동작이나 위험한 흐름을 제한한다.
문서로 확인 가능한 사실: OpenAI는 safety guide와 sandboxing 문서에서 guardrail, policy check, 실행 제한을 설명하고, Anthropic은 hooks, sandboxing, security 문서에서 실행 전후 통제 지점을 설명한다(<a href="https://developers.openai.com/api/docs/guides/agent-builder-safety">Safety in building agents</a>, <a href="https://developers.openai.com/codex/concepts/sandboxing">OpenAI sandboxing</a>, <a href="https://code.claude.com/docs/en/hooks">Claude hooks</a>, <a href="https://code.claude.com/docs/en/sandboxing">Claude sandboxing</a>, <a href="https://code.claude.com/docs/en/security">Claude security</a>)
해석: 이 글의 <code class="language-plaintext highlighter-rouge">guardrail</code>은 이 여러 장치를 실무적으로 묶어 부르는 표현이다.</p>

<p>예를 들어 비신뢰 입력을 그대로 실행 지시처럼 다루지 못하게 하는 것이 guardrail이다. 외부 문서, 이슈 본문, 웹에서 가져온 텍스트, connector나 MCP를 통해 들어온 데이터는 모두 유용한 문맥이 될 수 있다. 하지만 그 내용이 곧바로 시스템 지시처럼 agent 행동을 밀게 두면 위험하다. 외부 문서에 적혀 있다고 해서 바로 삭제, 배포, 권한 확장 같은 행동으로 이어지면 안 된다. 외부 텍스트는 참고 문맥일 수는 있어도, 곧바로 실행 권한을 얻는 입력이 되어서는 안 된다.</p>

<h2 id="어떤-영역에서-특히-중요해지는가">어떤 영역에서 특히 중요해지는가</h2>

<p>특히 세 가지 영역에서 approval과 guardrail이 중요해진다.
해석: 어느 영역이 특히 중요하다는 우선순위 판단은 필자의 운영 기준이다. 다만 공식 문서도 고위험 tool use, 외부 입력, 권한 경계, sandboxing을 별도 안전 주제로 다룬다(<a href="https://developers.openai.com/codex/agent-approvals-security">Agent approvals &amp; security</a>, <a href="https://developers.openai.com/api/docs/guides/agent-builder-safety">Safety in building agents</a>, <a href="https://code.claude.com/docs/en/permissions">Claude permissions</a>, <a href="https://code.claude.com/docs/en/sandboxing">Claude sandboxing</a>)</p>

<p>첫째는 destructive action이다. 삭제, 강한 overwrite, 대량 변경, 배포 같은 작업은 승인 경계가 필요하다. 둘째는 외부 데이터와 비신뢰 입력이다. 외부 문서의 내용을 그대로 실행 지시로 취급하면 안 되는 이유가 여기에 있다. 텍스트는 사실 주장일 수는 있어도, 권한 위임은 아니다. 셋째는 connector/MCP 같은 외부 연결이다. 이 경우 agent가 무엇을 읽었고, 어떤 도구를 호출했고, 그 결과를 어떤 판단에 사용했는지가 최소한 기록되어야 한다.</p>

<p>예를 들어 connector/MCP 사용 시에는 <code class="language-plaintext highlighter-rouge">source</code>, <code class="language-plaintext highlighter-rouge">tool_name</code>, <code class="language-plaintext highlighter-rouge">query_or_resource</code>, <code class="language-plaintext highlighter-rouge">timestamp</code>, <code class="language-plaintext highlighter-rouge">decision_usage</code> 같은 필드를 남기는 편이 좋다. 이것은 공식 표준 스키마라기보다, 나중에 “이 외부 데이터가 왜 이 행동으로 이어졌는가”를 추적하기 위한 실무적 최소 예시에 가깝다. 외부 연결을 쓴다는 사실 자체보다, 그 연결이 판단 과정에 어떤 영향을 줬는지 보이는 것이 더 중요하다.</p>

<h2 id="내가-뒤늦게-더-중요하게-보게-된-항목">내가 뒤늦게 더 중요하게 보게 된 항목</h2>

<p>나 역시 초기에 agent 운영을 설계할 때는 역할 분리, 검증, 문서 규칙 쪽에 더 집중한 적이 있다. 무엇을 어떻게 맡길지, 어떤 테스트를 붙일지, 어떤 문서를 갱신할지가 더 눈에 잘 들어왔기 때문이다. approval과 guardrail은 그다음에 덧붙이는 옵션처럼 보이기도 했다.
해석: 이 절은 필자가 approval boundary와 guardrail을 더 중요한 축으로 보게 된 경험을 정리한 부분이다. 문서로 확인 가능한 사실은 approvals, permissions, hooks, sandboxing이 모두 별도 운영 계층으로 존재한다는 점이다(<a href="https://developers.openai.com/codex/agent-approvals-security">Agent approvals &amp; security</a>, <a href="https://developers.openai.com/codex/hooks">OpenAI hooks</a>, <a href="https://code.claude.com/docs/en/permissions">Claude permissions</a>, <a href="https://code.claude.com/docs/en/hooks">Claude hooks</a>)</p>

<p>하지만 나중에 보니 이 부분은 뒤늦게 붙이는 장식이 아니라, 처음부터 설계해야 하는 항목에 더 가까웠다. 규칙이 좋아도 승인 경계가 비어 있으면 위험한 행동이 너무 쉽게 실행될 수 있고, guardrail이 약하면 외부 입력이 생각보다 큰 힘을 가지게 된다. 좋은 운영 문서와 안전한 실행 경계는 같은 것이 아니었다.</p>

<h2 id="마무리">마무리</h2>

<p>agent-first 운영 문서라면 approval boundary와 guardrail이 반드시 필요하다. 어떤 도구 호출은 사람 승인이 필요한지, destructive action은 어디서 멈춰야 하는지, 외부 텍스트와 비신뢰 입력은 어떻게 다뤄야 하는지, connector/MCP 사용 시 무엇을 기록해야 하는지를 처음부터 정해야 한다. 하네스 엔지니어링에서 중요한 것은 규칙을 많이 쓰는 일이 아니라, 그 규칙이 안전한 실행 경계 안에서 작동하게 만드는 일이다.</p>

<p>다음 글에서는 이 흐름을 실제 전환 로드맵으로 이어가 보려 한다. 문서 중심 운영에서 관측 가능한 하네스로 넘어갈 때 무엇부터 바꾸면 되는지를 실전 순서로 정리해볼 생각이다.</p>

<h2 id="이-글의-한-줄-요약">이 글의 한 줄 요약</h2>

<p>좋은 운영 규칙만으로는 agent 시스템을 안전하게 만들 수 없고, approval boundary와 guardrail 같은 실행 경계를 처음부터 함께 설계해야 한다.</p>

<h2 id="다음-글-예고">다음 글 예고</h2>

<p>다음 편에서는 “문서 중심 운영에서 관측 가능한 하네스로: 실전 전환 로드맵”을 다뤄보려 한다. 지금까지 시리즈에서 본 문제들은 서로 따로 떨어져 있지 않고, 실제로는 한 번에 조금씩 옮겨가야 하는 전환 과제에 가깝다. 그래서 다음 글에서는 무엇을 먼저 문서에서 분리하고, 무엇을 schema와 eval로 올리고, 무엇을 trace와 guardrail로 보강할지 순서를 잡아볼 생각이다. 처음부터 완벽한 하네스를 만드는 대신, 어떤 단계로 현실적으로 옮겨갈 수 있는지가 더 중요하기 때문이다. 이제부터는 개념 설명을 넘어 실제 적용의 순서를 다룰 차례다.</p>

<h2 id="출처-및-참고">출처 및 참고</h2>

<ul>
  <li>OpenAI, <a href="https://developers.openai.com/codex/agent-approvals-security">Agent approvals &amp; security</a></li>
  <li>OpenAI, <a href="https://developers.openai.com/codex/concepts/sandboxing">Sandboxing</a></li>
  <li>OpenAI, <a href="https://developers.openai.com/api/docs/guides/agent-builder-safety">Safety in building agents</a></li>
  <li>Anthropic, <a href="https://code.claude.com/docs/en/permissions">Permissions</a></li>
  <li>Anthropic, <a href="https://code.claude.com/docs/en/hooks">Hooks reference</a></li>
  <li>Anthropic, <a href="https://code.claude.com/docs/en/sandboxing">Sandboxing</a></li>
  <li>Anthropic, <a href="https://code.claude.com/docs/en/security">Security</a></li>
</ul>]]></content><author><name>K4nul</name></author><category term="AI" /><category term="ai" /><category term="llm" /><category term="harness-engineering" /><category term="guardrail" /><category term="approval" /><category term="mcp" /><summary type="html"><![CDATA[approval 경계와 guardrail이 비어 있을 때 agent 시스템이 흔들리는 이유를 다루는 글.]]></summary></entry><entry xml:lang="en"><title type="html">Rust 10. Testing in Rust</title><link href="https://www.k4nul.com/en/rust/testing-in-rust/" rel="alternate" type="text/html" title="Rust 10. Testing in Rust" /><published>2026-04-17T09:00:00+09:00</published><updated>2026-04-17T09:00:00+09:00</updated><id>https://www.k4nul.com/en/rust/rust-testing-basics-en</id><content type="html" xml:base="https://www.k4nul.com/en/rust/testing-in-rust/"><![CDATA[<h2 id="summary">Summary</h2>

<p>In Rust, getting code to run once is not the same as being able to trust future changes. As soon as functions grow and files start splitting apart, <code class="language-plaintext highlighter-rouge">cargo test</code> becomes one of the most practical tools for staying confident during refactoring.</p>

<p>This post introduces <code class="language-plaintext highlighter-rouge">cargo test</code>, <code class="language-plaintext highlighter-rouge">#[test]</code>, <code class="language-plaintext highlighter-rouge">assert_eq!</code>, <code class="language-plaintext highlighter-rouge">#[cfg(test)]</code>, integration tests, and tests that return <code class="language-plaintext highlighter-rouge">Result</code>. The practical conclusion is that Rust becomes much easier to test when input and output stay thin, core logic is kept in small functions, and reusable logic lives in <code class="language-plaintext highlighter-rouge">lib.rs</code>.</p>

<h2 id="document-information">Document Information</h2>

<ul>
  <li>Written on: 2026-04-15</li>
  <li>Verification date: 2026-04-16</li>
  <li>Document type: tutorial</li>
  <li>Test environment: Windows 11 Pro, Windows PowerShell, Cargo CLI examples</li>
  <li>Test version: rustc 1.94.0, cargo 1.94.0</li>
  <li>Source grade: only official documentation is used.</li>
  <li>Note: this post focuses on the beginner testing workflow and intentionally leaves broader topics such as async testing and property-based testing for separate material.</li>
</ul>

<h2 id="problem-definition">Problem Definition</h2>

<p>At the beginner stage, it is easy to think that a successful run means the program is done. But once code starts changing, several problems appear immediately.</p>

<ul>
  <li>Refactoring becomes risky because you have to check behavior manually.</li>
  <li>Edge cases need to be rerun by hand over and over.</li>
  <li>If all logic stays in <code class="language-plaintext highlighter-rouge">main.rs</code>, it is hard to verify just one behavior in isolation.</li>
</ul>

<p>The goal of this post is to treat testing not as a heavy process, but as the default way to keep small Rust programs safe to change.</p>

<h2 id="verified-facts">Verified Facts</h2>

<ul>
  <li>According to the official Cargo docs, <code class="language-plaintext highlighter-rouge">cargo test</code> builds and runs the tests for a project.
Evidence: <a href="https://doc.rust-lang.org/cargo/commands/cargo-test.html">cargo-test(1)</a></li>
  <li>According to the official Rust Book, the <code class="language-plaintext highlighter-rouge">#[test]</code> attribute marks a test function, and macros like <code class="language-plaintext highlighter-rouge">assert!</code>, <code class="language-plaintext highlighter-rouge">assert_eq!</code>, and <code class="language-plaintext highlighter-rouge">assert_ne!</code> are used to verify expectations.
Evidence: <a href="https://doc.rust-lang.org/book/ch11-01-writing-tests.html">How to Write Tests</a></li>
  <li>According to the official docs, unit tests are commonly placed in a <code class="language-plaintext highlighter-rouge">#[cfg(test)]</code> module in the same file, while integration tests go in the <code class="language-plaintext highlighter-rouge">tests/</code> directory.
Evidence: <a href="https://doc.rust-lang.org/book/ch11-03-test-organization.html">Test Organization</a></li>
  <li>According to the official docs, tests can also be written to return <code class="language-plaintext highlighter-rouge">Result&lt;(), E&gt;</code>.
Evidence: <a href="https://doc.rust-lang.org/book/ch11-01-writing-tests.html">How to Write Tests</a></li>
</ul>

<p>One of the most practical beginner layouts is this:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>rust-testing-basics/
  Cargo.toml
  src/
    lib.rs
  tests/
    normalize.rs
</code></pre></div></div>

<p>Put the function you want to verify in <code class="language-plaintext highlighter-rouge">src/lib.rs</code>.</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">pub</span> <span class="k">fn</span> <span class="nf">normalize_name</span><span class="p">(</span><span class="n">input</span><span class="p">:</span> <span class="o">&amp;</span><span class="nb">str</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="nb">String</span> <span class="p">{</span>
    <span class="n">input</span><span class="nf">.trim</span><span class="p">()</span><span class="nf">.to_lowercase</span><span class="p">()</span>
<span class="p">}</span>

<span class="nd">#[cfg(test)]</span>
<span class="k">mod</span> <span class="n">tests</span> <span class="p">{</span>
    <span class="k">use</span> <span class="k">super</span><span class="p">::</span><span class="n">normalize_name</span><span class="p">;</span>

    <span class="nd">#[test]</span>
    <span class="k">fn</span> <span class="nf">trims_and_lowercases</span><span class="p">()</span> <span class="p">{</span>
        <span class="nd">assert_eq!</span><span class="p">(</span><span class="nf">normalize_name</span><span class="p">(</span><span class="s">"  Alice "</span><span class="p">),</span> <span class="s">"alice"</span><span class="p">);</span>
    <span class="p">}</span>

    <span class="nd">#[test]</span>
    <span class="k">fn</span> <span class="nf">keeps_empty_string_after_trim</span><span class="p">()</span> <span class="p">{</span>
        <span class="nd">assert_eq!</span><span class="p">(</span><span class="nf">normalize_name</span><span class="p">(</span><span class="s">"   "</span><span class="p">),</span> <span class="s">""</span><span class="p">);</span>
    <span class="p">}</span>

    <span class="nd">#[test]</span>
    <span class="k">fn</span> <span class="nf">result_based_test</span><span class="p">()</span> <span class="k">-&gt;</span> <span class="nb">Result</span><span class="o">&lt;</span><span class="p">(),</span> <span class="nb">String</span><span class="o">&gt;</span> <span class="p">{</span>
        <span class="k">let</span> <span class="n">actual</span> <span class="o">=</span> <span class="nf">normalize_name</span><span class="p">(</span><span class="s">"Rust"</span><span class="p">);</span>

        <span class="k">if</span> <span class="n">actual</span> <span class="o">==</span> <span class="s">"rust"</span> <span class="p">{</span>
            <span class="nf">Ok</span><span class="p">(())</span>
        <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
            <span class="nf">Err</span><span class="p">(</span><span class="nd">format!</span><span class="p">(</span><span class="s">"unexpected value: {}"</span><span class="p">,</span> <span class="n">actual</span><span class="p">))</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Then place an integration test in <code class="language-plaintext highlighter-rouge">tests/normalize.rs</code> to verify the public API from the outside.</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">use</span> <span class="nn">rust_testing_basics</span><span class="p">::</span><span class="n">normalize_name</span><span class="p">;</span>

<span class="nd">#[test]</span>
<span class="k">fn</span> <span class="nf">integration_test_uses_public_api</span><span class="p">()</span> <span class="p">{</span>
    <span class="nd">assert_eq!</span><span class="p">(</span><span class="nf">normalize_name</span><span class="p">(</span><span class="s">"  Bob  "</span><span class="p">),</span> <span class="s">"bob"</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>One practical detail is that the package name <code class="language-plaintext highlighter-rouge">rust-testing-basics</code> is accessed in code as <code class="language-plaintext highlighter-rouge">rust_testing_basics</code>. This structure also prepares you well for later posts, because once file I/O or CLI logic is added, you can still keep the core logic testable inside <code class="language-plaintext highlighter-rouge">lib.rs</code>.</p>

<h2 id="directly-confirmed-results">Directly Confirmed Results</h2>

<ul>
  <li>Directly confirmed result: the Rust toolchain versions available in the current writing environment were:</li>
</ul>

<div class="language-powershell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">rustc</span><span class="w"> </span><span class="nt">--version</span><span class="w">
</span><span class="n">cargo</span><span class="w"> </span><span class="nt">--version</span><span class="w">
</span></code></pre></div></div>

<ul>
  <li>Observed output:</li>
</ul>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>rustc 1.94.0 (4a4ef493e 2026-03-02)
cargo 1.94.0 (85eff7c80 2026-01-15)
</code></pre></div></div>

<ul>
  <li>Directly confirmed result: when I ran <code class="language-plaintext highlighter-rouge">cargo test --quiet</code> against the same <code class="language-plaintext highlighter-rouge">src/lib.rs</code> and <code class="language-plaintext highlighter-rouge">tests/normalize.rs</code> structure as the post, the key output was:</li>
</ul>

<div class="language-powershell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cargo</span><span class="w"> </span><span class="nx">test</span><span class="w"> </span><span class="nt">--quiet</span><span class="w">
</span></code></pre></div></div>

<ul>
  <li>Observed output:</li>
</ul>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>running 3 tests
...
test result: ok. 3 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

running 1 test
.
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
</code></pre></div></div>

<ul>
  <li>Limitation of direct reproduction: I reproduced the test run in a temporary Cargo project, but I did not add a separate example test project to this repository.</li>
</ul>

<h2 id="interpretation--opinion">Interpretation / Opinion</h2>

<ul>
  <li>My view is that the most important beginner testing skill is not test syntax, but function boundaries.</li>
  <li>Opinion: if <code class="language-plaintext highlighter-rouge">main.rs</code> stays thin and the core transformation logic moves into <code class="language-plaintext highlighter-rouge">lib.rs</code>, testing becomes much easier.</li>
  <li>Opinion: unit tests are good for small rules, while integration tests are good for checking whether the public API is wired correctly. Beginners benefit from seeing both at least once.</li>
</ul>

<h2 id="limits-and-exceptions">Limits and Exceptions</h2>

<ul>
  <li>This post covers only the testing basics and does not include benchmarking, property-based testing, or snapshot testing.</li>
  <li>Async tests, database tests, and network tests may require extra runtime or environment setup, so they are outside the scope here.</li>
  <li>Options for controlling parallel execution, filtering tests, or handling output are documented in more detail in <code class="language-plaintext highlighter-rouge">cargo test</code>.</li>
  <li>The exact way you design failure messages and fixtures can vary by team or project.</li>
</ul>

<h2 id="references">References</h2>

<ul>
  <li><a href="https://doc.rust-lang.org/cargo/commands/cargo-test.html">cargo-test(1)</a></li>
  <li><a href="https://doc.rust-lang.org/book/ch11-00-testing.html">Writing Automated Tests</a></li>
  <li><a href="https://doc.rust-lang.org/book/ch11-01-writing-tests.html">How to Write Tests</a></li>
  <li><a href="https://doc.rust-lang.org/book/ch11-02-running-tests.html">Controlling How Tests Are Run</a></li>
  <li><a href="https://doc.rust-lang.org/book/ch11-03-test-organization.html">Test Organization</a></li>
</ul>]]></content><author><name>K4nul</name></author><category term="Rust" /><category term="rust" /><category term="testing" /><category term="cargo-test" /><category term="unit-test" /><category term="integration-test" /><summary type="html"><![CDATA[Beginner-friendly Rust testing guide to cargo test, unit tests, integration tests, assert_eq!, and Result-based tests.]]></summary></entry><entry xml:lang="ko"><title type="html">Rust 10. Testing in Rust</title><link href="https://www.k4nul.com/rust/rust-testing-basics/" rel="alternate" type="text/html" title="Rust 10. Testing in Rust" /><published>2026-04-17T09:00:00+09:00</published><updated>2026-04-17T09:00:00+09:00</updated><id>https://www.k4nul.com/rust/rust-testing-basics</id><content type="html" xml:base="https://www.k4nul.com/rust/rust-testing-basics/"><![CDATA[<h2 id="요약">요약</h2>

<p>Rust에서 코드를 실행하는 것과 코드를 검증하는 것은 다른 단계다. 예제가 한 번 돌아갔다고 해서 이후 수정에도 같은 결과가 유지되는 것은 아니기 때문에, <code class="language-plaintext highlighter-rouge">cargo test</code>와 기본 테스트 구조를 일찍 익히는 편이 훨씬 안전하다.</p>

<p>이 글은 <code class="language-plaintext highlighter-rouge">cargo test</code>, <code class="language-plaintext highlighter-rouge">#[test]</code>, <code class="language-plaintext highlighter-rouge">assert_eq!</code>, <code class="language-plaintext highlighter-rouge">#[cfg(test)]</code>, integration test, <code class="language-plaintext highlighter-rouge">Result</code>를 반환하는 테스트를 초급자 기준으로 정리한다. 결론부터 말하면 테스트하기 쉬운 Rust 코드를 만들려면 “입출력은 얇게, 핵심 로직은 작은 함수로, 재사용할 함수는 <code class="language-plaintext highlighter-rouge">lib.rs</code>에”라는 원칙을 먼저 잡는 것이 좋다.</p>

<h2 id="문서-정보">문서 정보</h2>

<ul>
  <li>작성일: 2026-04-15</li>
  <li>검증 기준일: 2026-04-16</li>
  <li>문서 성격: tutorial</li>
  <li>테스트 환경: Windows 11 Pro, Windows PowerShell, Cargo CLI 예시</li>
  <li>테스트 버전: rustc 1.94.0, cargo 1.94.0</li>
  <li>출처 등급: 공식 문서만 사용했다.</li>
  <li>비고: 이 글은 초급 테스트 흐름에 집중하며, async 테스트나 property-based testing 같은 확장 주제는 별도 문서 범위로 남겨 둔다.</li>
</ul>

<h2 id="문제-정의">문제 정의</h2>

<p>초급 단계에서는 프로그램이 한 번 실행되면 “이제 됐다”고 느끼기 쉽다. 하지만 함수가 늘어나고 파일이 분리되기 시작하면 아래 문제가 바로 생긴다.</p>

<ul>
  <li>리팩터링 뒤에 예전 동작이 깨졌는지 눈으로만 확인해야 한다.</li>
  <li>edge case를 계속 수동으로 재실행해야 한다.</li>
  <li><code class="language-plaintext highlighter-rouge">main.rs</code>에 로직이 몰려 있으면 특정 동작만 따로 검증하기 어렵다.</li>
</ul>

<p>이번 글의 목표는 테스트를 거창한 품질 절차로 보는 대신, “작은 Rust 함수를 계속 안전하게 바꾸기 위한 기본 도구”로 이해하는 데 있다.</p>

<h2 id="확인된-사실">확인된 사실</h2>

<ul>
  <li>공식 문서 기준으로 <code class="language-plaintext highlighter-rouge">cargo test</code>는 프로젝트의 테스트를 빌드하고 실행한다.
근거: <a href="https://doc.rust-lang.org/cargo/commands/cargo-test.html">cargo-test(1)</a></li>
  <li>공식 문서 기준으로 <code class="language-plaintext highlighter-rouge">#[test]</code> 속성은 테스트 함수로 인식되게 하고, <code class="language-plaintext highlighter-rouge">assert!</code>, <code class="language-plaintext highlighter-rouge">assert_eq!</code>, <code class="language-plaintext highlighter-rouge">assert_ne!</code> 같은 매크로로 기대 결과를 검증할 수 있다.
근거: <a href="https://doc.rust-lang.org/book/ch11-01-writing-tests.html">How to Write Tests</a></li>
  <li>공식 문서 기준으로 단위 테스트는 보통 같은 파일 안의 <code class="language-plaintext highlighter-rouge">#[cfg(test)]</code> 모듈에 두고, integration test는 <code class="language-plaintext highlighter-rouge">tests/</code> 디렉터리에 둔다.
근거: <a href="https://doc.rust-lang.org/book/ch11-03-test-organization.html">Test Organization</a></li>
  <li>공식 문서 기준으로 테스트 함수는 <code class="language-plaintext highlighter-rouge">Result&lt;(), E&gt;</code>를 반환하도록 작성할 수도 있다.
근거: <a href="https://doc.rust-lang.org/book/ch11-01-writing-tests.html">How to Write Tests</a></li>
</ul>

<p>초급자에게 가장 실용적인 구조는 아래처럼 잡는 것이다.</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>rust-testing-basics/
  Cargo.toml
  src/
    lib.rs
  tests/
    normalize.rs
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">src/lib.rs</code>에는 검증하고 싶은 순수 함수를 둔다.</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">pub</span> <span class="k">fn</span> <span class="nf">normalize_name</span><span class="p">(</span><span class="n">input</span><span class="p">:</span> <span class="o">&amp;</span><span class="nb">str</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="nb">String</span> <span class="p">{</span>
    <span class="n">input</span><span class="nf">.trim</span><span class="p">()</span><span class="nf">.to_lowercase</span><span class="p">()</span>
<span class="p">}</span>

<span class="nd">#[cfg(test)]</span>
<span class="k">mod</span> <span class="n">tests</span> <span class="p">{</span>
    <span class="k">use</span> <span class="k">super</span><span class="p">::</span><span class="n">normalize_name</span><span class="p">;</span>

    <span class="nd">#[test]</span>
    <span class="k">fn</span> <span class="nf">trims_and_lowercases</span><span class="p">()</span> <span class="p">{</span>
        <span class="nd">assert_eq!</span><span class="p">(</span><span class="nf">normalize_name</span><span class="p">(</span><span class="s">"  Alice "</span><span class="p">),</span> <span class="s">"alice"</span><span class="p">);</span>
    <span class="p">}</span>

    <span class="nd">#[test]</span>
    <span class="k">fn</span> <span class="nf">keeps_empty_string_after_trim</span><span class="p">()</span> <span class="p">{</span>
        <span class="nd">assert_eq!</span><span class="p">(</span><span class="nf">normalize_name</span><span class="p">(</span><span class="s">"   "</span><span class="p">),</span> <span class="s">""</span><span class="p">);</span>
    <span class="p">}</span>

    <span class="nd">#[test]</span>
    <span class="k">fn</span> <span class="nf">result_based_test</span><span class="p">()</span> <span class="k">-&gt;</span> <span class="nb">Result</span><span class="o">&lt;</span><span class="p">(),</span> <span class="nb">String</span><span class="o">&gt;</span> <span class="p">{</span>
        <span class="k">let</span> <span class="n">actual</span> <span class="o">=</span> <span class="nf">normalize_name</span><span class="p">(</span><span class="s">"Rust"</span><span class="p">);</span>

        <span class="k">if</span> <span class="n">actual</span> <span class="o">==</span> <span class="s">"rust"</span> <span class="p">{</span>
            <span class="nf">Ok</span><span class="p">(())</span>
        <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
            <span class="nf">Err</span><span class="p">(</span><span class="nd">format!</span><span class="p">(</span><span class="s">"unexpected value: {}"</span><span class="p">,</span> <span class="n">actual</span><span class="p">))</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">tests/normalize.rs</code>에는 외부 사용자 시점의 integration test를 둘 수 있다.</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">use</span> <span class="nn">rust_testing_basics</span><span class="p">::</span><span class="n">normalize_name</span><span class="p">;</span>

<span class="nd">#[test]</span>
<span class="k">fn</span> <span class="nf">integration_test_uses_public_api</span><span class="p">()</span> <span class="p">{</span>
    <span class="nd">assert_eq!</span><span class="p">(</span><span class="nf">normalize_name</span><span class="p">(</span><span class="s">"  Bob  "</span><span class="p">),</span> <span class="s">"bob"</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>여기서 <code class="language-plaintext highlighter-rouge">rust-testing-basics</code> package 이름은 코드에서 <code class="language-plaintext highlighter-rouge">rust_testing_basics</code>처럼 언더스코어 경로로 접근한다. 이 구조는 앞으로 파일 I/O나 CLI가 들어가더라도 핵심 로직을 <code class="language-plaintext highlighter-rouge">lib.rs</code>에 두고 테스트를 유지하기 쉽다는 장점이 있다.</p>

<h2 id="직접-확인한-결과">직접 확인한 결과</h2>

<ul>
  <li>직접 확인한 결과: 현재 작성 환경에서 Rust toolchain 버전은 아래와 같았다.</li>
</ul>

<div class="language-powershell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">rustc</span><span class="w"> </span><span class="nt">--version</span><span class="w">
</span><span class="n">cargo</span><span class="w"> </span><span class="nt">--version</span><span class="w">
</span></code></pre></div></div>

<ul>
  <li>관찰된 결과:</li>
</ul>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>rustc 1.94.0 (4a4ef493e 2026-03-02)
cargo 1.94.0 (85eff7c80 2026-01-15)
</code></pre></div></div>

<ul>
  <li>직접 확인한 결과: 본문 예제와 같은 <code class="language-plaintext highlighter-rouge">src/lib.rs</code>, <code class="language-plaintext highlighter-rouge">tests/normalize.rs</code> 구조에서 <code class="language-plaintext highlighter-rouge">cargo test --quiet</code>를 실행했을 때 핵심 출력은 아래와 같았다.</li>
</ul>

<div class="language-powershell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cargo</span><span class="w"> </span><span class="nx">test</span><span class="w"> </span><span class="nt">--quiet</span><span class="w">
</span></code></pre></div></div>

<ul>
  <li>관찰된 결과:</li>
</ul>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>running 3 tests
...
test result: ok. 3 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

running 1 test
.
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
</code></pre></div></div>

<ul>
  <li>직접 확인 범위의 한계: 본문 구조를 기준으로 임시 테스트 프로젝트를 만들어 <code class="language-plaintext highlighter-rouge">cargo test</code>는 실행했지만, 이 저장소 안에 별도 테스트 예제 프로젝트를 추가하지는 않았다.</li>
</ul>

<h2 id="해석--의견">해석 / 의견</h2>

<ul>
  <li>내 판단으로는 Rust 테스트 입문에서 가장 중요한 것은 테스트 문법보다 함수 경계다.</li>
  <li>의견: <code class="language-plaintext highlighter-rouge">main.rs</code>에 입출력과 분기만 두고, 핵심 계산이나 변환 로직을 <code class="language-plaintext highlighter-rouge">lib.rs</code>로 분리하면 테스트 난이도가 크게 내려간다.</li>
  <li>의견: unit test는 작은 규칙을 빠르게 잡아 주고, integration test는 공개 API가 기대대로 연결되는지 보는 데 적합하다. 초급 단계에서는 두 종류를 모두 한 번 경험하는 편이 좋다.</li>
</ul>

<h2 id="한계와-예외">한계와 예외</h2>

<ul>
  <li>이 글은 기본 테스트 흐름만 다루며, benchmark, property-based testing, snapshot testing은 포함하지 않았다.</li>
  <li>async 함수 테스트, 데이터베이스 테스트, 네트워크 테스트는 runtime과 외부 환경이 추가로 필요할 수 있어 이번 범위에서 제외했다.</li>
  <li>병렬 실행 제어, 출력 캡처, 특정 테스트만 실행하는 옵션은 <code class="language-plaintext highlighter-rouge">cargo test</code> 문서에 더 자세한 내용이 있다.</li>
  <li>실패 메시지와 fixture 구성을 어떻게 설계할지는 팀 규칙에 따라 달라질 수 있다.</li>
</ul>

<h2 id="참고자료">참고자료</h2>

<ul>
  <li><a href="https://doc.rust-lang.org/cargo/commands/cargo-test.html">cargo-test(1)</a></li>
  <li><a href="https://doc.rust-lang.org/book/ch11-00-testing.html">Writing Automated Tests</a></li>
  <li><a href="https://doc.rust-lang.org/book/ch11-01-writing-tests.html">How to Write Tests</a></li>
  <li><a href="https://doc.rust-lang.org/book/ch11-02-running-tests.html">Controlling How Tests Are Run</a></li>
  <li><a href="https://doc.rust-lang.org/book/ch11-03-test-organization.html">Test Organization</a></li>
</ul>]]></content><author><name>K4nul</name></author><category term="Rust" /><category term="rust" /><category term="testing" /><category term="cargo-test" /><category term="unit-test" /><category term="integration-test" /><summary type="html"><![CDATA[cargo test, unit test, integration test, assert_eq!, Result 반환 테스트를 익히는 Rust 테스트 입문 가이드.]]></summary></entry><entry xml:lang="en"><title type="html">Harness Engineering 08. Results Alone Are Not Enough; You Need Trace</title><link href="https://www.k4nul.com/en/ai/why-trace-matters-more-than-the-result/" rel="alternate" type="text/html" title="Harness Engineering 08. Results Alone Are Not Enough; You Need Trace" /><published>2026-04-17T00:00:00+09:00</published><updated>2026-04-17T00:00:00+09:00</updated><id>https://www.k4nul.com/en/ai/why-trace-matters-more-than-results-en</id><content type="html" xml:base="https://www.k4nul.com/en/ai/why-trace-matters-more-than-the-result/"><![CDATA[<h2 id="summary">Summary</h2>

<p>The result is preserved, but no one knows why the system acted that way. In harness engineering, that situation becomes a real problem surprisingly often. The code output is there and the execution log exists, but the reasoning is missing: why this team was chosen as the owner, why this tool was selected, why the task was delegated to a sub-agent, or where a handoff actually failed. That is why this post focuses on the difference between result logs and the decision-relevant layer inside a trace, and why trace is what makes real improvement possible.</p>

<h2 id="verification-scope-and-interpretation-boundary">Verification scope and interpretation boundary</h2>

<ul>
  <li>As of: April 15, 2026, checked OpenAI Codex docs, OpenAI platform docs, and Anthropic Claude Code docs.</li>
  <li>Source grade: official docs first, plus vendor-authored engineering posts only when a concept is introduced there.</li>
  <li>Fact boundary: only documented features such as <code class="language-plaintext highlighter-rouge">AGENTS.md</code>, memory/settings, hooks, subagents, approvals, sandboxing, evals, and trace are treated as factual claims.</li>
  <li>Interpretation boundary: terms like <code class="language-plaintext highlighter-rouge">harness engineering</code>, <code class="language-plaintext highlighter-rouge">control plane</code>, <code class="language-plaintext highlighter-rouge">contract</code>, <code class="language-plaintext highlighter-rouge">enforcement</code>, and <code class="language-plaintext highlighter-rouge">observable harness</code> are operating abstractions used in this series unless a source line says otherwise.</li>
</ul>

<h2 id="result-logs-and-the-decision-relevant-layer-of-trace">Result Logs and the Decision-Relevant Layer of Trace</h2>

<p>Result logs usually record what happened. They show whether tests passed, which files changed, which commands were run, and what the final artifact looked like. That is useful and necessary.
Documented fact: OpenAI’s Agents SDK guide explicitly talks about keeping a “full trace of what happened,” and the trace grading guide treats traces as evaluation inputs(<a href="https://developers.openai.com/api/docs/guides/agents-sdk">Agents SDK</a>, <a href="https://developers.openai.com/api/docs/guides/trace-grading">Trace grading</a>)</p>

<p>What I mean here by the decision-relevant layer of trace is not a formal vendor term, but a descriptive way to point at the parts of a trace that preserve decision context: why the platform team was chosen instead of the auth team, why a certain analysis tool was used instead of a simpler one, why a sub-agent was introduced, or why a handoff happened at that specific point. Storing the result and preserving that decision context may look similar, but they serve different purposes.</p>

<h2 id="what-goes-wrong-when-trace-is-missing">What Goes Wrong When Trace Is Missing</h2>

<p>When trace is weak, teams are forced to guess from the output alone. Imagine a case where the final code exists, but the work was routed to the wrong owner. Looking at the result may tell you who ended up touching the code, but not why that routing decision happened. That means the next fix often becomes guesswork rather than a real improvement to the harness.
Interpretation: this section explains the operational problem of reconstructing failures from outputs alone. The factual anchor is that trace is meant to carry more than the final artifact, including intermediate decisions and tool use(<a href="https://developers.openai.com/api/docs/guides/trace-grading">Trace grading</a>)</p>

<p>Tool choice works the same way. An execution log may show which tool was used, but not why it was chosen. Without that reason, the same poor choice can repeat while the harness team still lacks a clear place to intervene. Handoff failure is similar. The result may show that information was missing, but only trace can show which step failed to pass it along.</p>

<h2 id="what-kinds-of-reasoning-should-be-preserved">What Kinds of Reasoning Should Be Preserved?</h2>

<p>This does not mean recording every thought in exhaustive detail. It means preserving the minimum reasoning needed for improvement. At a minimum, it helps to capture why an owner was selected, why a tool was chosen, whether a sub-agent was used and why, when the handoff happened, and whether scope was narrowed or expanded.
Interpretation: the specific minimum fields listed here are my recommendation. The official direction behind them is that trace grading evaluates reasoning process and tool path, not just the final answer(<a href="https://developers.openai.com/api/docs/guides/trace-grading">Trace grading</a>, <a href="https://developers.openai.com/codex/subagents">Subagents</a>)</p>

<p>With that kind of trace, teams can later see decisions like “this change touches both <code class="language-plaintext highlighter-rouge">payments/</code> and <code class="language-plaintext highlighter-rouge">docs/payments.md</code>, so the payments owner should handle it” or “a sub-agent was used not for parallelism, but because a distinct domain context was required.” At that point, trace stops being a debugging note and starts becoming harness-improvement data.</p>

<h2 id="how-trace-connects-to-eval">How Trace Connects to Eval</h2>

<p>Trace is the raw material for eval. If a wrong-owner routing eval fails, trace helps distinguish whether the issue came from a bad ownership decision, stale ownership mapping, or a missing routing rule. Tool-choice eval works the same way. Looking only at the output makes it hard to tell the difference between a lucky success and a well-justified success. Trace makes that difference visible.
Documented fact: OpenAI officially documents trace grading as an evaluation technique(<a href="https://developers.openai.com/api/docs/guides/trace-grading">Trace grading</a>, <a href="https://developers.openai.com/api/docs/guides/evaluation-best-practices">Evaluation best practices</a>)</p>

<p>Imagine a task that ended successfully, but the trace shows that a sub-agent was unnecessary, handoff leaked information twice, and documentation updates were delayed. The product result may still be acceptable, but the orchestration quality was weak. Trace is what reveals that gap.</p>

<h2 id="the-lack-of-trace-i-felt-directly">The Lack of Trace I Felt Directly</h2>

<p>For a while, I also assumed that execution results and logs would be enough. If I knew what ran and what the final result was, it seemed like I should be able to reconstruct what mattered. Later, it became clear that the more important question was whether the reasoning behind the decisions had been preserved at all. Without that, harness improvement kept collapsing back into guesswork from the outcome.</p>

<p>The real lesson was not just that trace is convenient. It was that trace is what makes structural improvement possible.</p>

<h2 id="wrap-up">Wrap-up</h2>

<p>Storing outputs and execution logs still matters. But that alone is not enough to understand an agent system well. Harness engineering includes not only the result, but also the observability of the reasoning process. That is why trace becomes a core asset for debugging, evaluation, and improvement.</p>

<p>The next post will connect this to approval and guardrails. Once you can observe the decision path, the next question is which kinds of decisions should be constrained or escalated before they happen.</p>

<h2 id="one-sentence-summary">One-Sentence Summary</h2>

<p>Outputs and execution logs are not enough to improve a harness; you also need trace that preserves why the system made the choices it made.</p>

<h2 id="preview-of-the-next-post">Preview of the Next Post</h2>

<p>The next post will explore why agent systems become unstable when approval and guardrails are missing. Some decisions are not enough to observe after the fact; they need to be blocked, limited, or escalated up front. That is especially true for risky tool use, edits outside intended boundaries, and actions that require stronger permissions. So the next step is to look at how approval and guardrails contribute to stability inside the harness. If trace creates observability, guardrails define the allowed space in which those decisions can happen.</p>

<h2 id="sources-and-references">Sources and references</h2>

<ul>
  <li>OpenAI, <a href="https://developers.openai.com/api/docs/guides/agents-sdk">Agents SDK</a></li>
  <li>OpenAI, <a href="https://developers.openai.com/api/docs/guides/trace-grading">Trace grading</a></li>
  <li>OpenAI, <a href="https://developers.openai.com/api/docs/guides/evaluation-best-practices">Evaluation best practices</a></li>
  <li>OpenAI, <a href="https://developers.openai.com/codex/subagents">Subagents</a></li>
</ul>]]></content><author><name>K4nul</name></author><category term="AI" /><category term="ai" /><category term="llm" /><category term="codex" /><category term="claude-code" /><category term="harness-engineering" /><category term="trace" /><summary type="html"><![CDATA[Explains why results alone are not enough and why trace matters for operations, debugging, and evaluation.]]></summary></entry><entry xml:lang="ko"><title type="html">하네스 엔지니어링 08. 결과만으로는 부족하고 trace가 필요하다</title><link href="https://www.k4nul.com/ai/why-trace-matters-more-than-results/" rel="alternate" type="text/html" title="하네스 엔지니어링 08. 결과만으로는 부족하고 trace가 필요하다" /><published>2026-04-17T00:00:00+09:00</published><updated>2026-04-17T00:00:00+09:00</updated><id>https://www.k4nul.com/ai/why-trace-matters-more-than-results</id><content type="html" xml:base="https://www.k4nul.com/ai/why-trace-matters-more-than-results/"><![CDATA[<h2 id="요약">요약</h2>

<p>결과는 남았는데, 왜 그렇게 했는지는 모른다. 하네스 엔지니어링을 하다 보면 이 상황이 꽤 자주 문제로 떠오른다. 코드 결과물도 있고 실행 로그도 있는데, 왜 이 팀이 owner였는지, 왜 이 tool을 골랐는지, 왜 sub-agent에게 넘겼는지, 어느 지점에서 handoff가 실패했는지는 보이지 않는 경우다. 그래서 이번 글에서는 결과 로그와 trace 안의 판단 근거가 왜 다른지, 그리고 왜 trace가 있어야 개선이 가능해지는지 이야기해보려 한다.</p>

<h2 id="검증-기준과-해석-경계">검증 기준과 해석 경계</h2>

<ul>
  <li>시점: 2026-04-15 기준 OpenAI Codex 문서, OpenAI 공식 API 문서, Anthropic Claude Code 문서를 확인했다.</li>
  <li>출처 등급: 공식 문서 우선, 개념을 처음 소개한 vendor-authored 글만 보조로 사용했다.</li>
  <li>사실 범위: <code class="language-plaintext highlighter-rouge">AGENTS.md</code>, memory/settings, hooks, subagents, approvals, sandboxing, eval, trace처럼 문서로 확인 가능한 기능만 사실로 적었다.</li>
  <li>해석 범위: <code class="language-plaintext highlighter-rouge">harness engineering</code>, <code class="language-plaintext highlighter-rouge">control plane</code>, <code class="language-plaintext highlighter-rouge">contract</code>, <code class="language-plaintext highlighter-rouge">enforcement</code>, <code class="language-plaintext highlighter-rouge">observable harness</code> 같은 표현은 이 시리즈에서 쓰는 운영 개념이며, 별도 근거 줄이 없는 경우 필자의 해석이다.</li>
</ul>

<h2 id="결과-로그와-trace-안의-판단-근거의-차이">결과 로그와 trace 안의 판단 근거의 차이</h2>

<p>결과 로그는 보통 “무엇이 일어났는가”를 남긴다. 테스트가 통과했는지, 파일이 수정됐는지, 명령이 실행됐는지, 최종 산출물이 무엇인지 같은 정보가 여기에 들어간다. 이것만으로도 운영에 필요한 기본 기록은 남길 수 있다.
문서로 확인 가능한 사실: OpenAI는 Agents SDK 가이드에서 “full trace of what happened”를 설명하고, trace grading 가이드는 trace를 평가 입력으로 다룬다(<a href="https://developers.openai.com/api/docs/guides/agents-sdk">Agents SDK</a>, <a href="https://developers.openai.com/api/docs/guides/trace-grading">Trace grading</a>)</p>

<p>여기서 말하는 판단 근거 trace는 공식 표준 용어라기보다, trace 안에서도 owner 선택, tool 선택, handoff 시점 같은 의사결정 맥락을 가리키는 설명용 표현에 가깝다. 왜 인증 팀이 아니라 플랫폼 팀이 owner로 선택됐는지, 왜 grep 대신 특정 분석 도구를 썼는지, 왜 단일 agent로 끝내지 않고 sub-agent에게 넘겼는지, 왜 handoff를 그 시점에 만들었는지 같은 정보가 여기에 해당한다. 결과물 저장과 이런 판단 근거 추적은 비슷해 보여도 목적이 전혀 다르다.</p>

<h2 id="trace가-없을-때-생기는-운영-문제">trace가 없을 때 생기는 운영 문제</h2>

<p>trace가 약하면 실패 원인을 결과만 보고 추정해야 한다. 예를 들어 결과물은 남아 있는데 auth 관련 수정이 잘못된 팀으로 라우팅되었다고 해보자. 코드만 보면 “누가 이걸 맡았는가”는 알 수 있어도, 왜 그렇게 판단했는지는 모른다. 그러면 다음 개선은 규칙을 정교화하는 대신 사람의 추측에 기대게 된다.
해석: 이 절은 trace가 약할 때 원인 분석이 결과 추정에 머문다는 운영 문제를 설명한 부분이다. 문서로 확인 가능한 사실은 trace가 output만이 아니라 intermediate decision과 tool use까지 평가 재료가 된다는 점이다(<a href="https://developers.openai.com/api/docs/guides/trace-grading">Trace grading</a>)</p>

<p>tool choice도 마찬가지다. 실행 로그에는 어떤 도구를 썼는지는 남을 수 있지만, 왜 그 도구를 골랐는지는 빠지기 쉽다. 그 이유가 남지 않으면 부적절한 선택이 반복되어도 하네스를 어떻게 고쳐야 할지 감이 약해진다. handoff 실패 역시 비슷하다. 결과만 보면 “정보가 부족했다”는 사실은 보여도, 어느 단계에서 누가 무엇을 넘기지 않았는지는 trace가 있어야 분해할 수 있다.</p>

<h2 id="어떤-판단-근거를-남겨야-하는가">어떤 판단 근거를 남겨야 하는가</h2>

<p>모든 생각을 장황하게 기록하자는 뜻은 아니다. 개선에 필요한 최소 판단 근거를 남기자는 뜻에 가깝다. 적어도 owner 선택 이유, tool 선택 이유, sub-agent 위임 여부와 이유, handoff 시점과 누락 여부, 범위 축소나 확장 판단 정도는 추적 가능해야 한다.
해석: owner 선택 이유, tool 선택 이유, sub-agent 위임 여부 같은 항목은 필자가 제안하는 최소 trace 필드다. 다만 OpenAI의 trace grading 문서는 trace에 reasoning process와 tool path를 담아 평가하는 방향을 공식적으로 제시한다(<a href="https://developers.openai.com/api/docs/guides/trace-grading">Trace grading</a>, <a href="https://developers.openai.com/codex/subagents">Subagents</a>)</p>

<p>예를 들어 trace가 남아 있다면 “이 변경은 <code class="language-plaintext highlighter-rouge">payments/</code>와 <code class="language-plaintext highlighter-rouge">docs/payments.md</code>를 함께 건드리므로 결제 owner가 맡아야 한다” 같은 판단을 나중에 다시 볼 수 있다. 또 “sub-agent를 쓴 이유는 병렬 구현이 아니라 별도 도메인 문맥 분리가 필요했기 때문”이라는 흔적이 있으면, 멀티에이전트 사용이 정당했는지도 평가할 수 있다. 이런 trace는 디버깅용 메모가 아니라, 하네스 개선용 데이터에 가깝다.</p>

<h2 id="trace와-eval은-어떻게-연결되는가">trace와 eval은 어떻게 연결되는가</h2>

<p>trace는 eval의 재료가 된다. wrong-owner routing eval이 실패했을 때, trace가 있으면 owner 선택 근거가 잘못됐는지, ownership mapping이 낡았는지, 라우팅 규칙이 비어 있었는지를 더 빠르게 가를 수 있다. tool-choice eval도 마찬가지다. 결과만 보면 우연히 성공한 실행과 합리적으로 성공한 실행을 구분하기 어렵지만, trace가 있으면 그 차이를 평가할 수 있다.
문서로 확인 가능한 사실: OpenAI는 trace grading을 eval 기법으로 공식 문서화한다(<a href="https://developers.openai.com/api/docs/guides/trace-grading">Trace grading</a>, <a href="https://developers.openai.com/api/docs/guides/evaluation-best-practices">Evaluation best practices</a>)</p>

<p>예를 들어 한 작업에서 결과는 성공했지만 trace를 보니 불필요하게 sub-agent를 호출했고, handoff도 두 번 새면서 문서 갱신이 늦어졌다고 해보자. 이 경우 제품은 맞을 수 있지만 orchestration은 좋지 않았다. trace는 바로 이런 차이를 결과 바깥에서 드러내 준다.</p>

<h2 id="내가-직접-느낀-trace의-부족함">내가 직접 느낀 trace의 부족함</h2>

<p>나 역시 한동안은 실행 결과와 로그만 남으면 충분하다고 생각한 적이 있다. 무엇이 실행됐고, 마지막에 어떤 결과가 나왔는지를 알면 대체로 복기할 수 있다고 봤기 때문이다. 하지만 나중에 보니 더 중요한 것은 왜 그런 판단을 했는지가 남아 있는지였다. 그 흔적이 없으면 하네스를 개선하는 일은 결국 결과를 보고 추정하는 작업에 머물기 쉬웠다.</p>

<p>이 경험이 보여준 것은 trace가 있으면 편리하다는 수준이 아니었다. trace가 있어야만 실패를 구조적으로 고칠 수 있다는 점에 더 가까웠다.</p>

<h2 id="마무리">마무리</h2>

<p>결과물 저장과 실행 로그는 중요하다. 하지만 그것만으로는 agent 시스템을 충분히 이해할 수 없다. 하네스 엔지니어링은 결과뿐 아니라 판단 과정의 관측 가능성까지 포함하며, 바로 그 지점에서 trace는 디버깅과 평가, 개선의 핵심 자산이 된다.</p>

<p>다음 글에서는 이 흐름을 approval과 guardrail로 이어가 보려 한다. trace로 판단 과정을 본다면, 이제 어떤 판단은 아예 제한하거나 승인 흐름에 올려야 하기 때문이다.</p>

<h2 id="이-글의-한-줄-요약">이 글의 한 줄 요약</h2>

<p>결과와 실행 로그만으로는 하네스를 개선하기 어렵고, 왜 그런 판단을 했는지 보여주는 trace가 있어야 실패 원인을 구조적으로 고칠 수 있다.</p>

<h2 id="다음-글-예고">다음 글 예고</h2>

<p>다음 편에서는 “approval과 guardrail이 비어 있으면 agent 시스템은 불안정하다”를 다뤄보려 한다. 어떤 판단은 trace로 남기는 것만으로는 부족하고, 애초에 특정 경로를 막거나 승인 단계로 올려야 한다. 특히 위험한 도구 사용, 경계 밖 수정, 큰 권한이 필요한 작업은 사후 분석보다 사전 통제가 더 중요해진다. 그래서 다음 글에서는 approval과 guardrail이 왜 하네스의 안정성을 좌우하는지 정리해볼 생각이다. trace가 관측 가능성을 만든다면, guardrail은 그 위에서 허용 범위를 정한다.</p>

<h2 id="출처-및-참고">출처 및 참고</h2>

<ul>
  <li>OpenAI, <a href="https://developers.openai.com/api/docs/guides/agents-sdk">Agents SDK</a></li>
  <li>OpenAI, <a href="https://developers.openai.com/api/docs/guides/trace-grading">Trace grading</a></li>
  <li>OpenAI, <a href="https://developers.openai.com/api/docs/guides/evaluation-best-practices">Evaluation best practices</a></li>
  <li>OpenAI, <a href="https://developers.openai.com/codex/subagents">Subagents</a></li>
</ul>]]></content><author><name>K4nul</name></author><category term="AI" /><category term="ai" /><category term="llm" /><category term="codex" /><category term="claude-code" /><category term="harness-engineering" /><category term="trace" /><summary type="html"><![CDATA[결과만으로는 부족하고 trace가 필요한 이유를 운영, 디버깅, 평가 관점에서 설명한 글.]]></summary></entry></feed>