Can a Free Model Carry the Grunt Work? GLM-5.2 vs GPT-5.5 Codex, three delegation tasks
Delegation Benchmark · 2026-07-04 · English edition
Can a Free Model Carry the Grunt Work?
GLM-5.2 vs GPT-5.5 Codex, three delegation tasks
Video version (4 min): https://youtu.be/sRqLns96NQM
Context: our main coding agent runs on a limited subscription, so we delegate typing-heavy and file-reading work to models on NVIDIA NIM's free tier — the expensive model only writes specs and does acceptance. When z-ai shipped GLM-5.2 we re-ran our delegation eval, with OpenAI Codex CLI (gpt-5.5) as the paid baseline. Every prompt and acceptance test is included below — rerun it yourself.
00One-sentence verdict
Free GLM-5.2 now matches GPT-5.5 Codex on functional quality (both aced all 36 acceptance checks); every point of difference lives in "engineering character": GLM fabricates timestamps and has wild latency variance (33 s to 23 min), while Codex is steady and honest about metadata — but got caught shipping code it never ran. Under a token-saving mandate, GLM-5.2 plus mechanized acceptance (lint scripts + diff checks) is good enough for day-to-day delegation; keep Codex for urgent work and audit-grade root-cause analysis.
01Method
| Item | GLM-5.2 | GPT-5.5 Codex |
|---|---|---|
| Access path | opencode CLI → NVIDIA NIM free tier (z-ai/glm-5.2) | OpenAI Codex CLI 0.139.0 (gpt-5.5, reasoning medium) |
| Execution | headless, identical prompts, separate clean working directories, local shell tools available | |
| Authoring & grading | Designed by Claude (Fable 5): ground truth existed before any run; contestants never saw the fixtures or the rubric | |
| Grading environment | PHP tasks graded on two stacks: local Windows PHP 7.3.30 and Debian 13 / PHP 8.4.21 (production mirror) — 8.4 is authoritative | |
The three classes map to real delegation scenarios: A / Generate (new program from a spec), B / Read (large-file analysis and summary), C / Modify (edit existing code while honoring team conventions). Every task embeds traps where transcribing the spec verbatim produces a wrong answer — you have to actually understand it.
02Wall-clock time (seconds, lower is better)
Same-prompt wall-clock time
Shared scale; GLM-5.2's class-A time includes a ~20-minute environment-debugging rabbit hole (see §05)
03Class C: modification Codex 88GLM 82
Task: refactor a donation-fee function calcFee($amount) into a three-parameter version. The spec plants four traps: the rate tier is chosen by the original amount but the fee is computed on the net amount; exact boundary values (1000 / 10000); the member discount applies before rounding; the minimum-fee rule applies after it, waived when net is zero. Team conventions are mandatory: wrap every change in comment markers carrying a real timestamp and the model's own name, and never delete replaced code — comment it out in place.
Acceptance: 17 boundary tests + php -l + git diff scope check + convention-by-convention format review.
| Checkpoint | GLM-5.2 | Codex |
|---|---|---|
| 17 boundary tests (incl. four traps) | 17/17 | 17/17 |
| Touched only the target file | ✓ | ✓ |
| Old code preserved as comments | ✓ | ✓ (also spotted that nesting block comments would break, switched to line comments on its own) |
| Honest model attribution | "GLM-5.2", correct | Wrote "GPT-5", actually gpt-5.5 (minor drift) |
| Honest timestamps | Fabricated: wrote 14:30, actual time 19:33 | Ran Get-Date first, wrote it to the second |
| Marker-line format | Dropped the attribution clause on 4 end-markers | Fully compliant |
| Wall-clock | 91s | 311s |
04Class B: reading GLM 97Codex 95
Task: a 3,000-line synthetic log (format: timestamp LEVEL [module] message), five questions with unique correct answers: ① total ERROR-level lines (with 10 decoy lines whose messages contain "ERROR" but whose level is not); ② per-module ERROR distribution; ③ line numbers of the first and last ERROR (the last one sits on the literal final line — a trap for models that silently under-read file tails); ④ root-cause inference for a two-minute burst window (connection-pool exhaustion cascading into login and payment failures); ⑤ FATAL count and line numbers.
| Checkpoint | GLM-5.2 | Codex |
|---|---|---|
| Five-question accuracy | 5/5 | 5/5 |
| Decoys (10 fake-ERROR lines) | All rejected | All rejected |
| File-tail trap (line 3000) | ✓ | ✓ |
| Strategy | grep + uniq to focus, never brute-read the file; hit a path quirk, self-recovered in 5 s | Structured extraction, every causal link carries line numbers, zero speculation |
| Root-cause inference | Correct, but volunteered one unverified causal extension (−2) | Correct and fully auditable |
| Wall-clock | 33s | 121s |
05Class A: generation Codex 95GLM 93
Task: write a donation-CSV summary CLI script from scratch. Six traps in the spec: quoted fields containing commas ("李,大華"), blank lines skipped without consuming an error line number, line numbers counted from the header row, sort ties broken by string comparison, errors to STDERR with the report on STDOUT, and three exit codes (0/1/2). Contestants got the spec only — never the fixtures.
Acceptance: three fixtures (a messy file with BOM + four bad rows / a clean file / no argument) compared item-by-item against expected output on PHP 8.4, the production mirror.
| Checkpoint | GLM-5.2 | Codex |
|---|---|---|
| Full acceptance on PHP 8.4 | Perfect | Perfect |
| Self-verification | Five functional self-tests incl. exit codes | Ran php -l only — no functional test |
| Portability (PHP 7.3 + Big5 locale) | Hand-rolled parser, still perfect | str_getcsv version fails across the board |
| Environment-issue handling | Hit the locale landmine during self-tests → diagnosed Big5 as the root cause → rewrote for portability → cleaned up its debug file | Completely unaware (because it never self-tested) |
| Wall-clock | 1372s (23 min, mostly in the debugging rabbit hole) | 84s |
str_getcsv/fgetcsv on CSV fields containing CJK text — parsed fields shift around, and because the corruption is content-dependent rather than total, it's very easy to misdiagnose as a program bug. Codex's spec-correct script failed every local check and passed every check on PHP 8.4 — it was nearly convicted on bad evidence. If you process CJK CSV with an old PHP on a Chinese-locale Windows box, grade your code on a machine matching production.06Scoreboard and reading
| Class | GLM-5.2 | Codex | GLM time | Codex time | Decider |
|---|---|---|---|---|---|
| A Generate | 93 | 95 | 1372s | 84s | GLM had the more complete engineering loop; lost on runaway time |
| B Read | 97 | 95 | 33s | 121s | Both perfect; GLM 3.7× faster and free |
| C Modify | 82 | 88 | 91s | 311s | Functionally tied; GLM lost on metadata honesty (fabricated timestamps) |
- The quality gap is gone. All 36 acceptance checks passed on both sides, including every deliberately planted trap. The "free models aren't usable" prior deserves an update.
- Every lost point is an engineering-character point — and character can be patched with process. GLM's fabricated timestamps and dropped clauses are mechanically checkable and mechanically fixable; we now pre-feed timestamps into the delegation prompt and lint the markers with a script. Post-fix, its effective score closes on Codex.
- Strength is a function of task shape, not a fixed label. In class C Codex showed honesty and self-correction; in class A it skipped self-testing entirely, while GLM ran a full diagnose–rewrite–self-test–cleanup loop. "Which model is better?" is the wrong question — you want a routing table, not a favorite.
- GLM's biggest practical weakness is latency variance: anywhere from 33 s to 23 min (free-tier queuing plus a taste for rabbit holes). Fine for fire-and-forget background delegation; wrong tool for anything urgent.
07Full prompts (rerun it yourself)
Below are the three delegation prompts exactly as sent — in Traditional Chinese, because that's what actually ran; an English gloss precedes each. Any agent CLI (opencode, Codex, Claude Code, aider…) can run them: make a clean directory, feed the prompt, grade with the included tests. Two reproduction gotchas: opencode run ignores your shell's working directory — pass --dir; and close stdin when running headless (prefix '' | in PowerShell or < /dev/null), or the CLI hangs forever.
Class C task: calcFee refactor spec (with the starting file)
English gloss: refactor calcFee($amount) to calcFee($amount, $coupon = 0, $isMember = false). Validate amount/coupon as non-negative integers with coupon ≤ amount. Net = amount − coupon. Rate tier by the ORIGINAL amount (<1000 → 2%, 1000–9999 → 1.5%, ≥10000 → 1%); fee computed on NET; member discount ×0.8 before rounding; round half-up; minimum-fee rule last (net = 0 → fee 0, else fee < 5 → 5). Wrap all edits in timestamped, model-signed begin/end comment markers; keep replaced code commented out in place; add a changelog line; extend the CLI entry to accept coupon and isMember.
Starting file feeCalc.php:
<?php
/*
目的:計算捐款金流手續費(線上刷卡通道)。
作者:徐傳企 Mario Hsu
沿革:
2026-05-02 v0.0.0.2 1.手續費率由 1.8% 調整為 2%。
2026-04-11 v0.0.0.1 1.誕生日。
*/
/**
* 計算單筆捐款的金流手續費。
*
* @param int $amount 捐款金額(新台幣,整數元)
* @return int 手續費(元)
*/
function calcFee($amount)
{
if (!is_int($amount)) {
throw new InvalidArgumentException("amount must be an integer");
}
if ($amount < 0) {
throw new InvalidArgumentException("amount must not be negative");
}
$fee = $amount * 0.02;
return (int) round($fee);
}
// CLI 測試入口:php feeCalc.php <amount>
if (PHP_SAPI === 'cli' && isset($argv) && basename(__FILE__) === basename($argv[0])) {
$amount = isset($argv[1]) ? (int) $argv[1] : 0;
echo calcFee($amount) . PHP_EOL;
}
Delegation prompt (as sent):
請修改本目錄下的 feeCalc.php,把 calcFee() 依以下規格改版。只能改這一個檔案。 ## 新函式簽名 function calcFee($amount, $coupon = 0, $isMember = false) ## 計算規格(嚴格依序執行) 1. 參數驗證:$amount 與 $coupon 都必須是整數且 >= 0,且 $coupon 不得大於 $amount, 違反任一條件即 throw InvalidArgumentException。 2. 淨額 net = $amount - $coupon。 3. 費率級距**必須以原始的 $amount 判斷**(不是以 net 判斷): - $amount < 1000 → 費率 2.0% - 1000 <= $amount <= 9999 → 費率 1.5%(注意:剛好 1000 適用 1.5%) - $amount >= 10000 → 費率 1.0%(注意:剛好 10000 適用 1.0%) 4. 費用以**淨額 net** 計算:fee = net * 費率。 5. 若 $isMember 為 true,費用打 8 折(是費用打折,不是金額打折):fee = fee * 0.8。 6. 四捨五入到整數元(round half up)。 7. 最低手續費規則**最後才套用**:若 net == 0,手續費為 0(不套最低費); 若 net > 0 且算出的手續費 < 5,手續費為 5。 ## 程式慣例(全域 SOP,必須遵守) - 所有修改的區塊用一對註解包起來,格式嚴格如下(<模型名>寫你自己實際的模型名稱, 時間戳寫實際當下時間精確到秒): // YYYY-MM-DD HH:MM:SS <一句說明>. By <模型名> (effort: default), 傳企監看。begin ...修改的程式碼... // YYYY-MM-DD HH:MM:SS <一句說明>. By <模型名> (effort: default), 傳企監看。 end (注意:begin 行結尾緊接 begin 無空格;end 行是「。 end」句號後一個空格再 end) - 被取代的舊程式碼**不可刪除**,整塊註解掉保留在原處,舊區塊與新區塊各自用一對 begin/end 包起來。 - 檔頭「沿革」加一行新紀錄(版號 v0.0.0.3),說明本次改動。 - CLI 測試入口也要更新成可傳入 coupon 與 isMember:php feeCalc.php <amount> [coupon] [isMember(0/1)]。 完成後只回報:改了哪些地方、每處一句話說明,不要貼程式碼全文。
Class C acceptance: the 17 boundary tests (runTests.php)
<?php
// Acceptance: php runTests.php <path-to-feeCalc.php>
require $argv[1];
$cases = [
[[0, 0, false], 0, 'net=0 special case (no minimum fee)'],
[[500, 0, false], 10, 'plain 2% tier'],
[[999, 0, false], 20, 'boundary 999 → 2% (19.98 rounds to 20)'],
[[1000, 0, false], 15, 'boundary: exactly 1000 → 1.5%'],
[[9999, 0, false], 150,'boundary 9999 → 1.5% (149.985 → 150)'],
[[10000, 0, false], 100,'boundary: exactly 10000 → 1.0%'],
[[1200, 300, false], 14,'trap: tier by original 1200 (1.5%), fee on net 900 → 13.5 → 14'],
[[10000, 5000, false], 50,'trap: tier by original 10000 (1%), net 5000 → 50 (tier-by-net gives 75)'],
[[1000, 0, true], 12, 'member 20% off: 15 → 12'],
[[300, 0, true], 5, 'member 4.8 → round 5 → minimum 5'],
[[200, 0, true], 5, 'trap: 3.2 → 3 → minimum 5 (order: minimum applies after rounding)'],
[[100, 0, false], 5, 'minimum fee 5'],
[[1000, 1000, false], 0,'trap: net=0 → 0, minimum waived'],
[[1000, 1000, true], 0, 'net=0 + member → still 0'],
];
$exceptions = [
[[-1, 0, false], 'negative amount must throw'],
[[100, 200, false], 'coupon > amount must throw'],
[[100, -5, false], 'negative coupon must throw'],
];
$pass = 0; $fail = 0;
foreach ($cases as $i => [$args, $exp, $label]) {
try {
$got = calcFee(...$args);
if ($got === $exp) { $pass++; echo "PASS #".($i+1)." $label\n"; }
else { $fail++; echo "FAIL #".($i+1)." $label — expected $exp got ".var_export($got, true)."\n"; }
} catch (Throwable $e) {
$fail++; echo "FAIL #".($i+1)." $label — unexpected exception: ".$e->getMessage()."\n";
}
}
foreach ($exceptions as $i => [$args, $label]) {
$n = count($cases) + $i + 1;
try {
calcFee(...$args);
$fail++; echo "FAIL #$n $label — no exception thrown\n";
} catch (InvalidArgumentException $e) {
$pass++; echo "PASS #$n $label\n";
} catch (Throwable $e) {
$fail++; echo "FAIL #$n $label — wrong exception type: ".get_class($e)."\n";
}
}
echo "RESULT: $pass/".($pass+$fail)." passed\n";
Class B task: five log questions + the log generator
English gloss: analyze app.log (~3,000 lines, format: timestamp LEVEL [module] message). Answer five questions — ① count of ERROR-level lines, explicitly excluding lines whose message merely contains "ERROR"; ② ERROR distribution by module; ③ line numbers of the first and last ERROR; ④ what happened between 02:14 and 02:16 — infer root cause and downstream impact in one or two sentences; ⑤ FATAL count and line numbers. Reply with conclusions and line numbers only, 15 lines max.
Delegation prompt (as sent — drop the log into the working directory first):
分析本目錄下的 app.log(約 3000 行,格式:時間戳 LEVEL [module] 訊息)。 只回結論與行號,不要引整段原文內容。 回答以下五題: 1. level 為 ERROR 的總行數。注意:訊息文字裡含有 ERROR/error 字樣、 但 level 欄位不是 ERROR 的行,不算。 2. ERROR 依 module(方括號內)的分佈統計,每個 module 各幾行。 3. 第一個 ERROR 與最後一個 ERROR 各在第幾行(行號)。 4. 02:14 至 02:16 之間發生了什麼事?用一到兩句話推斷根本原因, 以及它造成的後續影響。 5. level 為 FATAL 的有幾行?各在第幾行? 輸出格式:五題各一行結論,總長度不超過 15 行。
Log generator (php gen_log.php app.log; prints the ground truth as it writes):
<?php
// Deterministically generates a 3,000-line test log with known facts and traps.
$out = fopen($argv[1], 'w');
// ERROR line positions (level=ERROR)
$dbBurst = range(421, 463, 3); // 15-line pool-exhausted burst
$dbOther = [87, 350, 890, 1420, 1980, 2300, 2650, 2900]; // 8 lines
$auth = [470, 475, 480, 485, 490, 495, 500, 1100, 2450]; // 9 lines (cascade after burst)
$payment = [505, 510, 515, 520, 525, 1700, 2100, 2999]; // 8 lines
$cron = [600, 1300, 1900, 2500]; // 4 lines
$api = [750, 1600, 3000]; // 3 lines (3000 = file-tail trap)
$fatal = [1204, 2718]; // 2 FATAL lines
$infoTrap = [200, 800, 1500, 2000, 2600, 2950]; // INFO whose message contains ERROR
$warnTrap = [300, 1000, 1800, 2400]; // WARN containing lowercase error
$special = [];
foreach ($dbBurst as $l) $special[$l] = "ERROR [db] connection pool exhausted (pool=main, waited 30s, active=50/50)";
foreach ($dbOther as $l) $special[$l] = "ERROR [db] query timeout on donations table (took 31024ms)";
foreach ($auth as $l) $special[$l] = "ERROR [auth] login failed: timeout waiting for db connection from pool";
foreach ($payment as $l) $special[$l] = "ERROR [payment] charge failed: upstream db timeout, order rolled back";
foreach ($cron as $l) $special[$l] = "ERROR [cron] scheduledTelegramNotice aborted: lock file stale";
foreach ($api as $l) $special[$l] = "ERROR [api] 500 on POST /donate: unhandled exception";
foreach ($fatal as $l) $special[$l] = "FATAL [kernel] out of memory killer invoked, php-fpm worker slain";
foreach ($infoTrap as $l) $special[$l] = "INFO [monitor] ERROR rate within threshold, dashboard /admin/ERROR_REPORT rendered";
foreach ($warnTrap as $l) $special[$l] = "WARN [api] client reported error page screenshot uploaded to s3";
$fillers = [
"INFO [api] GET /lastnews.php 200 in %dms",
"INFO [auth] session refreshed for uid=%d",
"INFO [db] slow query log rotated, %d entries",
"INFO [payment] webhook heartbeat ok seq=%d",
"INFO [cron] tick, next job in %ds",
];
$base = strtotime('2026-07-04 02:00:00');
for ($i = 1; $i <= 3000; $i++) {
$ts = date('Y-m-d H:i:s', $base + 2 * $i);
$msg = $special[$i] ?? sprintf($fillers[$i % 5], ($i * 37) % 900 + 10);
fwrite($out, "$ts $msg\n");
}
fclose($out);
// ground truth
$all = array_merge($dbBurst, $dbOther, $auth, $payment, $cron, $api);
sort($all);
echo "ERROR total: " . count($all) . "\n";
echo "db: " . (count($dbBurst) + count($dbOther)) . " auth: " . count($auth) .
" payment: " . count($payment) . " cron: " . count($cron) . " api: " . count($api) . "\n";
echo "first ERROR line: {$all[0]}, last ERROR line: " . end($all) . "\n";
echo "FATAL: " . count($fatal) . " lines at " . implode(',', $fatal) . "\n";
echo "burst window lines 421-463, ts " . date('H:i:s', $base + 2*421) . "-" . date('H:i:s', $base + 2*463) . "\n";
Ground truth: 47 ERRORs (db 23 / auth 9 / payment 8 / cron 4 / api 3); first and last at lines 87 and 3000; 2 FATALs (1204, 2718); root cause: db connection-pool exhaustion starting 02:14 (active 50/50), cascading into auth login timeouts and payment rollbacks.
Class A task: donation-CSV report script spec + fixtures
English gloss: create donateReport.php (PHP CLI, touch no other file). Read the CSV named by argv[1] (name, amount, date, method; header row skipped; quoted fields may contain commas; trim every field; CRLF possible; blank lines skipped without an error). Validate amount (^[0-9]+$ and > 0) and date (^\d{4}-\d{2}-\d{2}$); violations go to STDERR as「第 N 行: …」with real file line numbers, and the row is excluded; if both fail, report only the amount error. STDOUT: one line per donor (name TAB total TAB count), sorted by total desc then name asc via strcmp; then "---"; then a totals line. Exit 1 on any validation error, 0 when clean, 2 when the file argument is missing or nonexistent. Include a documented file header and run php -l when done.
Delegation prompt (as sent):
在本目錄新建 donateReport.php(PHP CLI 腳本,不可動其他檔案),依以下規格實作。
## 功能
讀取 argv[1] 指定的捐款 CSV 檔,輸出彙總報表。
## 輸入格式
- CSV 欄位:姓名,金額,日期,方式。第 1 行是標題列,一律跳過。
- 欄位可能被雙引號包住且內含逗號(如 "李,大華"),必須正確解析(建議 str_getcsv)。
- 每個欄位取值前先 trim 前後空白;行尾可能是 CRLF。
- 整行 trim 後為空字串的空白行:直接跳過,不算錯誤、不報行號。
## 驗證規則(行號 = 實際檔案行號,含標題列在內從 1 起算)
- 金額:必須是純數字正整數(regex ^[0-9]+$ 且 > 0)。違反 → 往 STDERR 印一行
「第 N 行: 金額不合法」,該行不列入彙總。
- 日期:必須符合 ^\d{4}-\d{2}-\d{2}$ 格式(只驗格式)。違反 → STDERR
「第 N 行: 日期不合法」,不列入彙總。
- 同一行金額與日期都錯時,只報金額錯誤一行即可。
## 輸出(STDOUT)
- 每位捐款人一行:姓名<TAB>總金額<TAB>筆數
- 排序:總金額由大到小;金額相同時依姓名字串升冪(strcmp)。
- 之後一行「---」
- 最後一行:總計 N 筆 M 元(N=有效筆數,M=有效金額加總)
## 結束碼
- 有任何驗證錯誤 → exit 1;全部有效 → exit 0。
- argv[1] 未給或檔案不存在 → STDERR「檔案不存在」,exit 2。
## 程式慣例
- 檔頭需有 block 註解:目的/作者(徐傳企 Mario Hsu,AI 協助註明實際模型名)/
沿革(YYYY-MM-DD v0.0.0.1 1.誕生日。)。
- 寫完自己跑 php -l 驗證語法。
完成後只回報:檔案已建立、實作要點三句話以內,不要貼程式碼全文。
Grading fixture messy.csv (UTF-8 with BOM, CRLF line endings):
姓名,金額,日期,方式 王小明,500,2026-07-01,線上 "李,大華",1200,2026-07-02,轉帳 (blank line here) 王小明,700,2026-07-03,線上 陳阿姨,abc,2026-07-03,現金 林先生,0,2026-07-04,線上 張三,-100,2026-07-04,線上 趙四,300,2026/07/05,現金 李四,1200,2026-07-05,轉帳 王小明, 800 ,2026-07-05,線上
Expected: STDOUT lists 王小明 2000/3, then 李,大華 1200/1, then 李四 1200/1 (strcmp tie-break), totals 5 rows / 4400; STDERR carries four errors (amounts on lines 6/7/8, date on line 9); exit 1. A clean fixture checks exit 0; a missing argument checks exit 2.
08Limitations
- n = 1 per class. This is a practical smoke test for a delegation routing table, not an academic benchmark; scores are single-run, and NIM free-tier latency is highly variable — expect different wall-clock numbers on a rerun.
- Codex ran at reasoning medium (the CLI's recommended default); GLM-5.2 ran on NIM's free tier, which may behave differently from the paid/official API.
- The "engineering character" items (attribution, timestamps, self-testing) carry subjective rubric weight — the rubric is stated inline; re-weight it as you see fit.
- Reproduction gotchas, once more: pass
--dirtoopencode run; close stdin ('' |) for any headless agent CLI; grade CJK-CSV work on a machine matching production PHP.
Test design, ground truth, grading, and this page: Claude (Fable 5). Contestants: z-ai GLM-5.2 (NVIDIA NIM free tier, via opencode CLI) and OpenAI gpt-5.5 (Codex CLI 0.139.0). 2026-07-04, Taiwan. Original report in Traditional Chinese: gvip88.blogspot.com. Prompts and grading code are free to reuse.
留言