Experience Report: Building a time-tracking AI assistant

This is a short experience report about using skills (with Codex and its models) to build a personal AI assistant that helps me maintain my time-tracking log.

To set expectations: the assistant does not manage my calendar or tasks. It helps me keep a time-tracking log that lives in a Markdown file by interpreting logging requests and editing the file for me (while categorising entries correctly).

I start most days with a bit of planning, which means adding entries to that log. The format is completely custom and tailored to my needs, and I wrote a small companion CLI tool, tt, to generate reports from it. (The project is open source on GitHub, but honestly I don't think it is useful to anyone other than me.)

To give an idea, this is what a day entry looks like:

## TT 2026-02-04

- #admin ##work 30m inbox and daily planning
- #prj-content ##work 2h article outline and research notes
- #prj-content ##work 1h 30m first draft writing
- prj-personal-assistant #llm ##work 1h walking skeleton
- prj-personal-assistant #llm ##work 1h create skills
- #break ##energy 20m outdoor walk
- #learning ##work 1h documentation reading and summary

And using tt, I can generate reports like:

Overview 2026-02-04 -> 2026-02-04:
- prj-content: 3h 30m
- prj-personal-assistant: 2h 00m
- learning: 1h 00m
- admin: 30m
- break: 20m

Total: 7h 20m

Breakdown:
- ##work: 7h 00m
- ##energy: 20m

Editing the file is not hard, but it is tedious. The goal of this project was not to replace my log format, but to make it easier to operate.

Today I ran a small LLM experiment to make logging less cumbersome. Instead of writing an entry like #prj-personal-assistant #llm #codex ##work 2h Setup walking skeleton, I want to be able to say: "Create a new task to set up a walking skeleton, add tags codex and llm, and attribute the time to the personal assistant project." And by "say" I mean it literally: I dictate it in normal language, it gets transcribed and sent to the LLM. This turned out to be a surprisingly fast (and fun) experiment with promising first results.

Technical details: I used the Codex agent and its models, mostly Codex 5.2. Working with Codex was smooth, but this post is not about comparing coding agents; I suspect it would work with any capable agent that supports skills.

I started with a log file containing over a year of time entries. That history was a good dataset to prime the LLM on the format: what a day looks like, what an entry line looks like, and how entries should be categorised with tags.

From there I moved into implementation, with a small set of local files and skills.

This is the file tree I ended up with (not ready to call it "architecture" yet):

AGENTS.md
skills
├── tt-cli
│   ├── references
│   │   └── command-cheatsheet.md
│   └── SKILL.md
└── tt-log
    ├── references
    │   ├── log-structure.md
    │   ├── tag-inference.md
    │   └── validation.md
    ├── scripts
    │   └── validate_tt_update.py
    └── SKILL.md
time-tracking-log.md

tt is an abbreviation for "time tracking".

In practice, AGENTS.md tells the agent which skill to use for which capability:

### Time tracking

### Time tracking

- Use `tt`, the custom time-tracking CLI, for time-tracking operations.
- Use `$tt-log` for `time-tracking-log.md` edits, tag inference, and 7h to 8h daily policy checks.
- Use `$tt-cli` for `tt` command discovery, report commands, and CLI troubleshooting.
  - Rule of thumb: log edits/validation => `$tt-log`; reporting/CLI usage questions => `$tt-cli`.

The two skills are the heart of the implementation:

tt-cli handles the tt CLI tool: command discovery, reporting, filters, and general troubleshooting.
tt-log handles log editing, task insertion, tag inference, section ordering, and policy checks.

From the start I wanted to use skills because my custom format and tooling are a specialised capability. Initially Codex suggested a single skill, but it was clear to me that reading/querying and writing/editing were different responsibilities, so I pushed it in that direction (it agreed, ha!).

That split improved the quality of outcomes. Beyond maintainability, making responsibilities explicit made behaviour more predictable, because the CLI skill gives the LLM a way to validate its work. The tt-log skill can focus on reliable edits and validation, while tt-cli handles queries like "how much do I still need to log today?" and validates the log.

The references locations for both skills were set up by the LLM while we created the skills. They are pretty clean in terms of responsibility, and reviewing and refining the split proved useful.

During implementation I also wanted basic checks for "did I log enough today?", so we added a validation workflow that checks a daily target range (7h to 8h). The logic is always the same, so I had it write a script: skills/tt-log/scripts/validate_tt_update.py.

I iteratively refined the default logging rules (which tags to use for which kinds of tasks, the fact that not all my days look the same, and so on). I don't expect it to be perfect, but I will probably tweak it over the next couple of weeks as exceptions pop up.

So in short:

Created initial time-tracking skill behaviour for structured log edits based on existing time tracking data.
Split responsibilities into two dedicated skills (tt-log and tt-cli).
Added automated validation for parse integrity, per-day totals, and a daily policy range.
Iteratively refined defaults and behaviour based on real usage (for example, a longer workout-at-noon baseline on Tuesdays).

Things I can now ask: "I want to fill the rest of the day with work on a project I forgot the tag of. Give me the last 5 projects I recorded time on so I can tell you what to log to." Before, this was not hard, but it involved a bunch of small chores: checking previous days, finding the right project tag, copying it into a new line, and calculating the time left for the day.

Besides the usefulness (and fun), there was an unexpectedly valuable lesson: AI assistance works best in the same way good code does. Define clear boundaries and add executable checks so changes are easier to make and the system can validate its own work.