My custom agent used 87% fewer tokens when I gave it Skills for its MCP tools

By Richard Seroter's Architecture Musings March 16, 2026 · Edited March 16, 2026

Today’s web apps don’t seem particularly concerned about resource consumption. The simplest site seems to eat up hundreds of MB of memory in my browser. We’ve probably gotten a bit lazy with

— Initialization & Configuration —
Load environment variables (like API keys) from the .env file
Suppress experimental warnings from the ADK
Redirect agent framework logs to a temporary folder
Initialize the services required to manage chat history and created artifacts
The Runner orchestrates the agent’s execution loop
Create a new session to hold the conversation state
— Execution Phase —
Send the initial prompt to the agent and trigger the run loop
— Agent Definition —
— Scenario 0: Raw Agent using Python Code Execution for Discovery and Analysis —

Today’s web apps don’t seem particularly concerned about resource consumption. The simplest site seems to eat up hundreds of MB of memory in my browser. We’ve probably gotten a bit lazy with optimization since many computers have horsepower to spare. But when it comes to LLM tokens, we’re still judicious. Most of us have bumped into quotas or unexpected costs!

I see many examples of introducing and tuning MCPs and skills for IDEs and agentic tools. But what about the agents you’re building? What’s the token impact of using MCPs and skills for custom agents?

I tried out six solutions with the Agent Development Kit (Python) (https://github.com/google/adk-python) and counted my token consumption for each. The tl;dr? A well-prompted Gemini with zero tools or skills is successful with the fewest tokens consumed, with the second best option being MCP + skills. Third-best in token consumption is raw Gemini plus skills.

I trust that you can find a thousand ways to do this better than me, but here’s a table with the best results from multiple runs of each of my experiments. The title of the post refers to the difference between scenarios 2 and 3.

ScenarioAgent DescriptionTurnsTokens0Instructions only, built in code execution tool71,2861Uses BigQuery MCP913,7632Uses BigQuery, AlloyDB, Cloud SQL MCPs29328,0833Uses BigQuery, AlloyDB, Cloud SQL MCPs with skill539,6224Use BigQuery MCP and a skill56,6535Instruction, skill, and built-in code execution tool2764,444

What’s the problem to solve?

I want an agent that can do some basic cloud FinOps for me. I’ve got a Google Cloud BigQuery table that is automatically populated with billing data for items in my project.

Let’s have an agent that can find the table and figure out what my most expensive Cloud Storage buckets are so far this month. This could be an agent we call from a platform like Gemini Enterprise (https://cloud.google.com/gemini-enterprise) so that our finance people (or team leads) could quickly get billing info.

A look at our agent runner

The Agent Development Kit (ADK) offers some powerful features for building robust agents. It has native support for MCPs and skills, and has built-in tools for services like Google Search (https://google.github.io/adk-docs/integrations/google-search/).

While the ADK does have a built-in BigQuery tool (https://google.github.io/adk-docs/integrations/bigquery/), I wanted to use the various managed MCP servers (https://docs.cloud.google.com/mcp/supported-products) Google Cloud offers.

Let’s look at some code. One file to start. The main.py file runs our agent and count the tokens from each turn of the LLM. The token counting magic was snagged from an existing sample app (https://github.com/google/adk-python/blob/main/contributing/samples/token_usage/main.py). For production scenarios, you might want to use our BigQuery Agent Analytics plugin for ADK that captures a ton of interesting data points about your agent runs, including tokens per turn.

Here’s the main.py file:

import asyncio import time import warnings

import agent from dotenv import load_dotenv from google.adk import Runner from google.adk.agents.run_config import RunConfig from google.adk.artifacts.in_memory_artifact_service import InMemoryArtifactService from google.adk.cli.utils import logs from google.adk.sessions.in_memory_session_service import InMemorySessionService from google.adk.sessions.session import Session from google.genai import types

— Initialization & Configuration —

import os

Load environment variables (like API keys) from the .env file

load_dotenv(os.path.join(os.path.dirname(file), ‘.env’), override=True)

Suppress experimental warnings from the ADK

warnings.filterwarnings(‘ignore’, category=UserWarning)

Redirect agent framework logs to a temporary folder

logs.log_to_tmp_folder()

async def main(): app_name = ‘my_app’ user_id_1 = ‘user1’

Initialize the services required to manage chat history and created artifacts

session_service = InMemorySessionService() artifact_service = InMemoryArtifactService()

The Runner orchestrates the agent’s execution loop

runner = Runner( app_name=app_name, agent=agent.root_agent, artifact_service=artifact_service, session_service=session_service, )

Create a new session to hold the conversation state

session_1 = await session_service.create_session( app_name=app_name, user_id=user_id_1 )

total_prompt_tokens = 0 total_candidate_tokens = 0 total_tokens = 0 total_turns = 0

async def run_prompt(session: Session, new_message: str): # Helper variables to track token usage and turns across the session nonlocal total_prompt_tokens nonlocal total_candidate_tokens nonlocal total_tokens nonlocal total_turns

# Structure the user's string input into the appropriate Content format
content = types.Content(
    role='user', parts=[types.Part.from_text(text=new_message)]
)
print('** User says:', content.model_dump(exclude_none=True))

# Stream events back from the Runner as the agent executes its task
async for event in runner.run_async(
    user_id=user_id_1,
    session_id=session.id,
    new_message=content,
):
  total_turns += 1
  
  # Print intermediate steps (text, tool calls, and tool responses) to the console
  if event.content and event.content.parts:
    for part in event.content.parts:
      if part.text:
        print(f'** {event.author}: {part.text}')
      if part.function_call:
        print(f'** {event.author} calls tool: {part.function_call.name}')
        print(f'   Arguments: {part.function_call.args}')
      if part.function_response:
        print(f'** Tool response from {part.function_response.name}:')
        print(f'   Response: {part.function_response.response}')

  if event.usage_metadata:
    total_prompt_tokens += event.usage_metadata.prompt_token_count or 0
    total_candidate_tokens += (
        event.usage_metadata.candidates_token_count or 0
    )
    total_tokens += event.usage_metadata.total_token_count or 0
    print(
        f'Turn tokens: {event.usage_metadata.total_token_count}'
        f' (prompt={event.usage_metadata.prompt_token_count},'
        f' candidates={event.usage_metadata.candidates_token_count})'
    )

print(
    f'Session tokens: {total_tokens} (prompt={total_prompt_tokens},'
    f' candidates={total_candidate_tokens})'
)

— Execution Phase —

start_time = time.time() print(‘Start time:’, start_time) print(’————————————’)

Send the initial prompt to the agent and trigger the run loop

await run_prompt(session_1, ‘Find the top 3 most expensive Cloud Storage buckets in our March 2026 billing export for project seroter-project-base’) print( await artifact_service.list_artifact_keys( app_name=app_name, user_id=user_id_1, session_id=session_1.id ) ) end_time = time.time() print(’————————————’) print(‘Total turns:’, total_turns) print(‘End time:’, end_time) print(‘Total time:’, end_time - start_time)

if name == ‘main’: asyncio.run(main())

Nothing too shocking here. But this gives me a fairly verbose output that lets me see how many turns and tokens each scenario eats up.

Scenario 0: Raw agent (no MCP, no tools) using Python code execution

In this foundational test, what if we ask the agent to answer the question without the help of any external tools? All it can do is write and execute Python code on the local machine using a built-in tool. This flavor is only for local dev, as there are more production-grade isolation options (https://google.github.io/adk-docs/integrations/gke-code-executor/) for running code.

Here’s the agent.py for this base scenario. I’ve got a decent set of instructions to guide the agent for how to write code to find and query the relevant table.

from google.adk.agents import LlmAgent from google.adk.skills import load_skill_from_dir from google.adk.tools import skill_toolset from google.adk.tools.mcp_tool import McpToolset, StreamableHTTPConnectionParams from google.adk.auth.auth_credential import AuthCredential, AuthCredentialTypes, ServiceAccount from fastapi.openapi.models import OAuth2, OAuthFlows, OAuthFlowClientCredentials from google.adk.code_executors.unsafe_local_code_executor import UnsafeLocalCodeExecutor

— Agent Definition —

— Scenario 0: Raw Agent using Python Code Execution for Discovery and Analysis —

root_agent = LlmAgent( name=“data_analyst_agent”, model=“gemini-3.1-flash-lite-preview”, instruction=“““You are a data analyst. CRITICAL: You have NO TOOLS registered. NEVER attempt a tool call or function call (like list_datasets or bq_list_dataset_ids). You MUST perform all technical tasks by writing and executing Python code blocks in markdown format (e.g., ` �EMPHWS1�python block. �EMPHWS2�python format block. Markdown SQL blocks (```sql) will NOT execute.”, tools=[billing_skill_toolset], code_executor=UnsafeLocalCodeExecutor() )

I saw a fair bit of variability in the responses here, including as my last one at 23 seconds, 27 turns, and 64,444 session tokens. In prior runs, I had as many as 35 turns and 107,980 tokens. I asked my coding tool to explain this, and it made some good points. This scenario took extra turns to load skills, write code, and run code. All that code ate up tokens.

Takeaways

This was fun. I’m sure you can do better, and please tell me how you improved on my tests. Some things to consider:

• Model choice matters. I had very different results as I navigated different Gemini models. Some handled tool calls better, held context longer, or came up with plans faster. You’d probably see unique results by using Claude or GPT models too.

• MCPs are better with skills. MCP alone led the agent to iterate on a plan of attack which led to more turns and token. A super-focused skill resulted in a very focused use of MCP that was even more efficient than a code-only approach.

• Instructions make a difference. Maybe the above won’t hold true with an even better prompt. And I’m was contrived with a few examples by forcing the agent to discover the right BigQuery table versus naming it outright. Good instructions can make a big impact on token usage.

• Agent frameworks give you many levers that impact token consumption. ADK is great, and is available for Java, JavaScript, Go, and Dart too. Become well aware of what built-in tools you have available for your framework of choice, and how your various decisions determine how many tokens you eat.

• Make token consumption visible. Not every tool or framework makes it obvious how to count up token use. Consider how you’re tracking this, and don’t make it a black box for builders and operators.

Feedback? Other scenarios I should have tried? Let me know.

Reference: https://seroter.com/2026/03/16/my-custom-agent-used-87-fewer-tokens-when-i-gave-it-skills-for-its-mcp-tools/

Write a comment

No comments yet.

My custom agent used 87% fewer tokens when I gave it Skills for its MCP tools

§— Initialization & Configuration —

§Load environment variables (like API keys) from the .env file

§Suppress experimental warnings from the ADK

§Redirect agent framework logs to a temporary folder

§Initialize the services required to manage chat history and created artifacts

§The Runner orchestrates the agent’s execution loop

§Create a new session to hold the conversation state

§— Execution Phase —

§Send the initial prompt to the agent and trigger the run loop

§— Agent Definition —

§— Scenario 0: Raw Agent using Python Code Execution for Discovery and Analysis —