April 16, 2026
How I Automated Meeting Action Items with Zoom and Gemini AI
Turning a 2-day delay into a 15-minute pipeline
A few weeks ago, a CEO I work with told me something frustrating: "We have 30 client meetings a week, and nobody knows what was decided until two days later."
The problem wasn't that meetings weren't happening. It was that action items lived in people's heads until someone remembered to write them down. By then, details were fuzzy, deadlines were missed, and clients noticed.
The existing setup? An N8N workflow that dumped raw Zoom transcripts into growing text files on Google Drive. Nobody read them. Thousands of words per meeting, unstructured, unsearchable.
I built a replacement in a day. Here's how.
The Architecture
The system is simple in concept:
- A cron job runs every 5 minutes
- It polls the Zoom API for new recordings with transcripts
- It matches each meeting to a client using keyword rules
- It downloads the VTT transcript and parses it into speaker-labeled text
- Gemini AI extracts action items, decisions, and a summary
- A formatted message goes to the client's Slack channel
- Everything gets stored in SQLite for dedup and history
Zoom Call Ends (~5 min processing)
|
v
[Cron every 5 min] poll.js
|-- Zoom API: list new recordings
|-- Match topic -> client (keyword rules)
|-- Download VTT transcript
|-- Parse VTT -> speaker-labeled text
|-- Gemini: extract structured data (JSON)
|-- Post to Slack
'-- Store in SQLite (dedup) The whole pipeline runs autonomously. No human intervention needed.
Zoom Server-to-Server OAuth
The first challenge was authentication. Zoom's API uses Server-to-Server OAuth for automated access — no user login required, which is exactly what a cron job needs.
The tricky part: S2S apps can't use users/me to list recordings. You need to query each user individually. The account had 4 users, so the recording discovery function loops through all of them:
export async function listRecordings(lookbackHours = 24) {
const token = await getAccessToken();
const users = await listUsers();
const allMeetings = [];
for (const user of users) {
const url = `https://api.zoom.us/v2/users/${user.id}/recordings` +
`?from=${from}&to=${to}&page_size=100`;
const res = await fetch(url, {
headers: { 'Authorization': `Bearer ${token}` },
});
const data = await res.json();
allMeetings.push(...(data.meetings || []));
}
return allMeetings;
} Token caching matters here. The access token lasts an hour, and we're polling every 5 minutes. No need to re-authenticate every time.
Parsing VTT Transcripts
Zoom generates WebVTT files for cloud recordings. They look like this:
WEBVTT
00:00:01.000 --> 00:00:05.000
<v Dan Smith> Welcome everyone to the weekly meeting
00:00:06.000 --> 00:00:10.000
<v Manuel Porras> Thanks Dan, let's get started The <v Speaker Name> tags are gold — they tell you who said what. My parser extracts timestamps, speaker names, and text into a clean format:
[00:00:01.000] Dan Smith: Welcome everyone to the weekly meeting
[00:00:06.000] Manuel Porras: Thanks Dan, let's get started This speaker-labeled format is critical for the AI extraction step. Without it, Gemini can't tell who committed to what.
Client Matching
With 30 clients and meetings with varied naming conventions, I needed a way to automatically match a meeting topic like "Northern Services — Weekly Huddle" to the right client config.
The approach is simple: a keyword rule set. Each client has a list of keywords, and the matcher checks if all tokens of any keyword appear in the meeting topic. Normalization handles accents, casing, and punctuation. The first match wins. Unmatched meetings go to a triage channel for manual review.
This isn't fancy NLP — it's a lookup table. But it works reliably for 30 clients and hasn't needed a single correction since deployment. Sometimes the boring solution is the right one.
AI Extraction with Gemini
This is where the magic happens. Gemini 2.0 Flash takes the parsed transcript (often 40,000+ characters) and returns structured JSON:
const model = genAI.getGenerativeModel({
model: 'gemini-2.0-flash',
generationConfig: {
temperature: 0.2,
maxOutputTokens: 4000,
responseMimeType: 'application/json',
},
}); The responseMimeType: 'application/json' is the key setting. It forces Gemini to return valid JSON every time — no markdown wrapping, no explanation text, just the structured data.
The prompt asks for a summary, attendee list, action items (with owner, due date, priority), decisions with context, and follow-up references.
From a real 78-minute meeting transcript (54,000 characters), Gemini extracted 5 action items and 3 decisions in about 6 seconds. Cost: effectively $0 on the Flash free tier.
The quality surprised me. It correctly attributed action items to specific people based on who volunteered in the conversation, not just who was mentioned. And it only included items that were explicitly committed to — not hypotheticals or suggestions.
The Slack Output
The formatted message hits the client's Slack channel within 15 minutes of the call ending:
Meeting Notes: Northern Services — Weekly (Mar 23, 2026)
Dan, Cathy, Manuel
SUMMARY
Reviewed Q1 performance. Approved spring landing page.
Increasing Google Ads budget 20%.
ACTION ITEMS
- Dan — Update GBP with spring hours (due: Mar 25)
- Manuel — Create spring landing page draft (due: Mar 28)
- Cathy — Send updated service area list (due: Mar 26)
DECISIONS
- Approved new landing page design for spring campaign
- Will increase Google Ads budget by 20% starting April Clean, scannable, actionable. No one has to read a 54,000-character transcript to know what happened.
Dedup and Reliability
The cron runs every 5 minutes, so dedup is essential. Each meeting's UUID gets stored in SQLite, and the pipeline checks before processing. If something fails mid-processing (Gemini timeout, Slack API error), the meeting gets marked as error with the message stored. On the next run, it won't re-process — but the error is visible for debugging.
Error alerts go to a dedicated Slack channel so the team knows immediately if something breaks.
What I Learned
VTT parsing is underrated. The speaker attribution in Zoom's VTT files is surprisingly reliable. Building a good parser for this format gave the entire pipeline its foundation.
Structured JSON output from LLMs is a game changer. Before responseMimeType: 'application/json', you'd pray the model returned parseable output. Now it's guaranteed. This single feature makes LLMs viable for production data extraction.
Simple client matching beats ML. A 30-entry lookup table with keyword rules has been 100% accurate. No training data needed, no model maintenance, no false positives.
The 15-minute SLA matters. The previous workflow had a 2-day lag. Not because the technology was slow — it was because it required a human to read transcripts and write summaries. Removing the human from the loop turned a 2-day process into a 15-minute one.
The Numbers
- 30+ meetings processed per week across 4 Zoom users
- 15-minute turnaround from call end to Slack post (was 2 days)
- ~6 seconds for AI extraction per meeting
- $0 cost for AI processing (Gemini Flash free tier)
- Zero false matches on client identification
- One day to build and deploy the entire pipeline
The system has been running autonomously for weeks now. No babysitting, no manual intervention, no missed meetings.
If your team is drowning in meeting follow-ups, this kind of pipeline is surprisingly straightforward to build. The components are all available — Zoom API, VTT parsing, structured LLM extraction, Slack API. The hard part isn't the technology. It's deciding to automate it.