How I Automated Meeting Action Items with Zoom and Gemini AI

A few weeks ago, a CEO I work with told me something frustrating: "We have 30 client meetings a week, and nobody knows what was decided until two days later."

The problem wasn't that meetings weren't happening. It was that action items lived in people's heads until someone remembered to write them down. By then, details were fuzzy, deadlines were missed, and clients noticed.

The existing setup? An N8N workflow that dumped raw Zoom transcripts into growing text files on Google Drive. Nobody read them. Thousands of words per meeting, unstructured, unsearchable.

I built a replacement in a day. Here's how.

The Architecture

The system is simple in concept:

A cron job runs every 5 minutes
It polls the Zoom API for new recordings with transcripts
It matches each meeting to a client using keyword rules
It downloads the VTT transcript and parses it into speaker-labeled text
Gemini AI extracts action items, decisions, and a summary
A formatted message goes to the client's Slack channel
Everything gets stored in SQLite for dedup and history

Zoom Call Ends (~5 min processing)
    |
    v
[Cron every 5 min] poll.js
    |-- Zoom API: list new recordings
    |-- Match topic -> client (keyword rules)
    |-- Download VTT transcript
    |-- Parse VTT -> speaker-labeled text
    |-- Gemini: extract structured data (JSON)
    |-- Post to Slack
    '-- Store in SQLite (dedup)

The whole pipeline runs autonomously. No human intervention needed.

Zoom Server-to-Server OAuth

The first challenge was authentication. Zoom's API uses Server-to-Server OAuth for automated access — no user login required, which is exactly what a cron job needs.

The tricky part: S2S apps can't use users/me to list recordings. You need to query each user individually. The account had 4 users, so the recording discovery function loops through all of them:

export async function listRecordings(lookbackHours = 24) {
  const token = await getAccessToken();
  const users = await listUsers();
  const allMeetings = [];

  for (const user of users) {
    const url = `https://api.zoom.us/v2/users/${user.id}/recordings` +
      `?from=${from}&to=${to}&page_size=100`;
    const res = await fetch(url, {
      headers: { 'Authorization': `Bearer ${token}` },
    });
    const data = await res.json();
    allMeetings.push(...(data.meetings || []));
  }

  return allMeetings;
}

Token caching matters here. The access token lasts an hour, and we're polling every 5 minutes. No need to re-authenticate every time.

Parsing VTT Transcripts

Zoom generates WebVTT files for cloud recordings. They look like this:

WEBVTT

00:00:01.000 --> 00:00:05.000
<v Dan Smith> Welcome everyone to the weekly meeting

00:00:06.000 --> 00:00:10.000
<v Manuel Porras> Thanks Dan, let's get started

The <v Speaker Name> tags are gold — they tell you who said what. My parser extracts timestamps, speaker names, and text into a clean format:

[00:00:01.000] Dan Smith: Welcome everyone to the weekly meeting
[00:00:06.000] Manuel Porras: Thanks Dan, let's get started

This speaker-labeled format is critical for the AI extraction step. Without it, Gemini can't tell who committed to what.

Client Matching

With 30 clients and meetings with varied naming conventions, I needed a way to automatically match a meeting topic like "Northern Services — Weekly Huddle" to the right client config.

The approach is simple: a keyword rule set. Each client has a list of keywords, and the matcher checks if all tokens of any keyword appear in the meeting topic. Normalization handles accents, casing, and punctuation. The first match wins. Unmatched meetings go to a triage channel for manual review.

This isn't fancy NLP — it's a lookup table. But it works reliably for 30 clients and hasn't needed a single correction since deployment. Sometimes the boring solution is the right one.

AI Extraction with Gemini

This is where the magic happens. Gemini 2.0 Flash takes the parsed transcript (often 40,000+ characters) and returns structured JSON:

const model = genAI.getGenerativeModel({
  model: 'gemini-2.0-flash',
  generationConfig: {
    temperature: 0.2,
    maxOutputTokens: 4000,
    responseMimeType: 'application/json',
  },
});

The responseMimeType: 'application/json' is the key setting. It forces Gemini to return valid JSON every time — no markdown wrapping, no explanation text, just the structured data.

The prompt asks for a summary, attendee list, action items (with owner, due date, priority), decisions with context, and follow-up references.

From a real 78-minute meeting transcript (54,000 characters), Gemini extracted 5 action items and 3 decisions in about 6 seconds. Cost: effectively $0 on the Flash free tier.

The quality surprised me. It correctly attributed action items to specific people based on who volunteered in the conversation, not just who was mentioned. And it only included items that were explicitly committed to — not hypotheticals or suggestions.

The Slack Output

The formatted message hits the client's Slack channel within 15 minutes of the call ending:

Meeting Notes: Northern Services — Weekly (Mar 23, 2026)
Dan, Cathy, Manuel

SUMMARY
Reviewed Q1 performance. Approved spring landing page.
Increasing Google Ads budget 20%.

ACTION ITEMS
- Dan — Update GBP with spring hours (due: Mar 25)
- Manuel — Create spring landing page draft (due: Mar 28)
- Cathy — Send updated service area list (due: Mar 26)

DECISIONS
- Approved new landing page design for spring campaign
- Will increase Google Ads budget by 20% starting April

Clean, scannable, actionable. No one has to read a 54,000-character transcript to know what happened.

Dedup and Reliability

The cron runs every 5 minutes, so dedup is essential. Each meeting's UUID gets stored in SQLite, and the pipeline checks before processing. If something fails mid-processing (Gemini timeout, Slack API error), the meeting gets marked as error with the message stored. On the next run, it won't re-process — but the error is visible for debugging.

Error alerts go to a dedicated Slack channel so the team knows immediately if something breaks.

What I Learned

VTT parsing is underrated. The speaker attribution in Zoom's VTT files is surprisingly reliable. Building a good parser for this format gave the entire pipeline its foundation.

Structured JSON output from LLMs is a game changer. Before responseMimeType: 'application/json', you'd pray the model returned parseable output. Now it's guaranteed. This single feature makes LLMs viable for production data extraction.

Simple client matching beats ML. A 30-entry lookup table with keyword rules has been 100% accurate. No training data needed, no model maintenance, no false positives.

The 15-minute SLA matters. The previous workflow had a 2-day lag. Not because the technology was slow — it was because it required a human to read transcripts and write summaries. Removing the human from the loop turned a 2-day process into a 15-minute one.

The Numbers

30+ meetings processed per week across 4 Zoom users
15-minute turnaround from call end to Slack post (was 2 days)
~6 seconds for AI extraction per meeting
$0 cost for AI processing (Gemini Flash free tier)
Zero false matches on client identification
One day to build and deploy the entire pipeline

The system has been running autonomously for weeks now. No babysitting, no manual intervention, no missed meetings.

If your team is drowning in meeting follow-ups, this kind of pipeline is surprisingly straightforward to build. The components are all available — Zoom API, VTT parsing, structured LLM extraction, Slack API. The hard part isn't the technology. It's deciding to automate it.

Have a workflow that needs automating? Let's talk.