Non-Brand Data

What Real SQL Work Taught Me About Being a Data Scientist

Cornellius Yudha Wijaya — Sat, 28 Mar 2026 15:07:48 GMT

Image by Ideogram.ai

"Real SQL work taught me that trustworthy definitions matter more than flashy queries."

I did not start by taking SQL seriously

Early in my career, I did not see SQL as central to being a data scientist. Most of my learning was built around Python, and the classes and bootcamps I joined reinforced that view. Python felt like the real language of data science. SQL felt useful, but distant.

So I did not reject it. I simply did not get enough exposure to it.

That distinction matters. When your early learning path is dominated by notebooks, models, and Python libraries, it is easy to assume that the real work starts once the data is already in front of you. In that worldview, SQL looks like preparation work. Helpful, yes. Foundational, no.

Real work changed that view gradually.

The more I worked in corporate settings, the clearer it became that many projects do not begin with modeling, dashboards, or machine learning. They begin with a more basic set of questions: Is the data available? Is the definition correct? Can the result be trusted enough for someone to act on it?

Subscribe now

Work forced the lesson

What changed my view of SQL was not one dramatic moment. It was the accumulation of projects. Again and again, the work pulled me toward the same reality: before anything becomes an analysis, a model, or a recommendation, someone has to make sure the data is available, correctly defined, and usable.

That is where SQL kept appearing.

Sometimes the request looked simple. A business team needed a report. Sometimes the request sounded more strategic. A project needed insight to inform a decision. Sometimes the work moved beyond a single analysis into the project's production life. In each case, SQL mattered not only for retrieving data, but for deciding whether the project itself rested on a solid foundation.

The difficult conversations were often not about syntax at all. They were about meaning. What exactly should count as a sale? Which time window should be used? Which source should be treated as the source of truth? If two tables produce different answers, which one reflects the real business process?

That was the point where SQL stopped feeling like a supporting skill and became infrastructure.

What real SQL work actually looked like

The lesson became clearer through a few recurring types of work. These were not glamorous, as they were simply the places where SQL kept proving its value.

Ad-hoc reporting and insight requests that looked simple but hid messy logic and scattered data.
Metric definition work, where the challenge was deciding what should count before writing the query.
Combining multiple data sources without destroying the business meaning of the result.
Preparing the right data for downstream analysis and modeling in Python.

1. Ad-hoc reporting taught me that simple requests are rarely simple

A lot of real SQL work starts with a seemingly harmless request. The business needs a report. Someone wants a quick performance update. A team asks for insight before a meeting. On paper, it sounds like a straightforward query.

In practice, it rarely is.

Sometimes the data is not available in one place. Sometimes it lives across several sources that were never designed to fit together neatly. Sometimes the logic needed to answer the question is more complicated than the request suggests. And often the timeline is short, so you do not have the luxury of slowly wading through the data.

That changed how I think about SQL skills. In real reporting work, the challenge is not just writing something that runs. The challenge is moving from a vague business question to a reliable answer under real constraints. That takes judgment, prioritization, and a clear sense of what the output needs to mean.

Useful SQL work is often less glamorous than people expect. It is not always about elegant tricks. Very often, it is about getting the right answer quickly enough to matter, without breaking the logic behind it.

2. Metric definition matters more than query complexity

If there is one area where real SQL work changed me the most, it is the definition of metrics.

In theory, a metric looks clean. In practice, even something as familiar as a sales number can go wrong depending on the time scope, exclusions, business rules, and source tables. A number can look precise and still be misleading if two teams are working from different assumptions or if one table captures the event differently from another.

That is why some SQL problems cannot be solved by clever syntax alone. You can write a technically correct query and still produce the wrong business answer.

The real work is often more basic and more demanding at the same time:

deciding what should count
deciding what should be excluded
choosing which table reflects the operational truth
making sure the result matches the way the business actually works

This is where collaboration becomes essential. There are many situations where the data exists, but understanding it requires discussion with business users who know the process behind the records. Without that alignment, a query may return rows but not the truth.

Over time, I started to see that some of the most dangerous problems in data work are not computational. They are definitional. A wrong definition can quietly damage a project, mislead stakeholders, or erode trust in the team long before anyone notices the issue.

3. Combining data sources is harder than it looks

Another lesson real SQL work taught me is that combining information from multiple sources without losing meaning is much harder than it first appears.

From the outside, joins can look like a purely technical step. In practice, they can become one of the most delicate parts of a project. Sometimes a clean primary key does not exist. Sometimes the relationship is not direct. Sometimes aggregation is needed before two datasets can even be compared. And sometimes each source reflects a slightly different view of the same business concept.

That creates several risks at once: duplicate rows, dropped records, timing mismatches, and numbers that appear structurally valid but are conceptually incorrect.

This is why SQL work often requires more collaboration than people expect. To combine sources responsibly, you frequently need validation from multiple stakeholders. The challenge is not merely to make the query run. The challenge is to preserve validity.

For me, this was one of the clearest moments where SQL became inseparable from business understanding. Good SQL was not just about retrieval. It was about preserving meaning as it moved across systems.

4. Even Python-heavy data science often begins with SQL

Because my early learning path emphasized Python, I initially imagined that most serious data-science work would begin there. In reality, SQL was often necessary before I could even start proper work in Python.

If the data lived in a SQL database, then SQL was the gatekeeper. It was how I extracted the relevant population, selected the appropriate time window, assembled the required columns, and checked whether the data were suitable for the task ahead. Whether the next step was exploratory analysis, feature preparation, modeling, or evaluation, SQL was often the first step.

That changed how I think about the relationship between SQL and data science. SQL is not simply what happens before the interesting work. Very often, it is part of the interesting work.

If the population is wrong, the feature set is incomplete, or the definition is unstable, the downstream Python work inherits that weakness. In that sense, SQL does not sit beneath data science. It sits inside it.

What I value in SQL work now

Real work also changed how I evaluate SQL skills in others and in myself.

I still care about writing cleaner, more efficient queries, especially as data grows larger and execution speed matters. But that is no longer the first thing I look for.

What I value first is this:

1. Correctness. The wrong data can quietly damage an entire project.
2. Stakeholder trust. Data work only becomes valuable when other people believe the result is dependable.
3. Maintainability. Many projects do not end after a single request, so someone has to live with the logic later.

A strong SQL practitioner, in my view, is not simply someone who knows a large amount of syntax. It is someone who understands the data definition, knows how to acquire the data in the most reliable way, and can produce logic that remains useful beyond the moment it was written.

What I would tell aspiring data scientists now

If your learning path has focused mostly on Python, I would say this clearly: do not treat SQL as optional.

You do not need to memorize every feature of the language before doing meaningful work. Documentation exists, and syntax can be learned as needed. But you do need to understand why SQL matters. It matters because data projects depend on access to the right data, under the right definitions, with logic that can withstand real business use.

That is the part I wish I had understood earlier. SQL is not important because it looks technical. It is important because it sits close to the truth conditions of data work. It is where data availability is tested. It is where definitions get challenged. It is where numbers either become trustworthy or fall apart.

For me, that has become one of the clearest professional lessons of real data work. SQL is not the opposite of data science, nor is it a lower-level skill beneath it. In many organizations, SQL is one of the foundations that allows data science to be useful at all.

And if there is one line I would leave readers with, it is this: real SQL work taught me that trustworthy definitions matter more than flashy queries.

If you are learning SQL now, learn it through real use cases. Learn it through reporting, metric definition, source validation, and the kind of business questions that force you to care about correctness.

Leave a comment

Best Stock Market data API in the AI Agent era

Cornellius Yudha Wijaya — Sat, 14 Mar 2026 07:49:27 GMT

Photo by Nicholas Cappello on Unsplash

The stock market data API landscape is changing. In the past, developers mostly evaluated providers on familiar dimensions: coverage, latency, pricing, documentation, and reliability. Those criteria still matter, but the rise of LLM-powered copilots, autonomous research workflows, and multi-agent financial systems has introduced a new requirement: how easily can a data provider plug into agentic software?

In that environment, the strongest providers are not just the ones with broad datasets. They are the ones that expose clean, structured interfaces that AI agents can query, reason over, and combine with downstream tools for analysis, monitoring, and decision support. Some vendors are already leaning into this shift with MCP servers and LLM-oriented resources. Others remain stronger as enterprise data backbones than as explicitly AI-native platforms.

In this article, we will explore the Best Stock Market data API in the AI Agent era. Curious about it?

Let’s get into it.

Subscribe now

Alpha Vantage

Overview

Alpha Vantage is a widely used financial data platform that provides real-time and historical market data APIs for equities, options, forex, cryptocurrencies, and macroeconomic indicators. The platform is designed to support both individual developers and professional trading systems through a simple, developer-friendly interface and a large catalog of market datasets.

A distinguishing feature of Alpha Vantage is the breadth of its data coverage. The platform delivers real-time and historical financial market data through programmatic APIs and spreadsheet integrations, enabling developers to build trading dashboards, quantitative research pipelines, and automated trading tools on top of a unified data interface.

The API also provides a rich library of built-in analytics—including technical indicators and fundamental datasets—allowing users to retrieve both raw market data and higher-level financial signals without implementing complex calculations themselves. In practice, this makes Alpha Vantage a flexible backbone for applications ranging from educational projects and fintech prototypes to production trading systems and investment research platforms.

What makes it valuable in the AI Agent era?

Alpha Vantage has become particularly relevant in the emerging ecosystem of LLM-powered financial tools and autonomous AI agents, largely because it provides structured market data in formats that are easy for agents and models to access, reason over, and integrate into automated workflows.

1. Native integration with AI agent ecosystems via MCP
Alpha Vantage provides an official Model Context Protocol (MCP) server, enabling large language models and agent-based applications to directly access financial data through standardized tools. The MCP server allows AI assistants and development environments to query real-time and historical stock market data programmatically, turning the API into a plug-and-play data source for agentic systems.

2. Compatibility with multi-agent financial research systems
Modern agentic trading frameworks increasingly rely on structured financial APIs like Alpha Vantage as data sources. For example, the open-source TradingAgents framework simulates a professional trading firm using multiple LLM-powered agents—such as fundamental analysts, technical analysts, sentiment analysts, traders, and risk managers—that collaborate to analyze equities and make decisions. This system is powered by Alpha Vantage API as the core data backbone.

3. Documentation and developer assets optimized for machine consumption
Another advantage in the LLM era is the structure and accessibility of Alpha Vantage’s developer resources. The platform provides comprehensive API documentation, examples, and community libraries across many programming languages, making it straightforward for both humans and AI coding agents to integrate financial data pipelines. Because LLM-powered development tools rely heavily on structured documentation, well-defined API endpoints, and example code, this ecosystem of docs, SDKs, and README files makes Alpha Vantage particularly easy for AI systems to learn and use.

In short

Alpha Vantage’s combination of structured financial APIs, an MCP interface for AI agents, and extensive developer documentation positions it as a data infrastructure layer for the emerging generation of AI-powered trading tools, research agents, and autonomous financial analysis systems.

Tradier

Overview

Tradier is a brokerage-focused API platform that combines market data, account access, and trading functionality. Its public API supports real-time, delayed, and historical market data through both request/response endpoints and streaming interfaces, while also exposing brokerage capabilities such as account information, positions, orders, watchlists, and trade execution.

A key differentiator is that Tradier is not just a data API. It is part of a brokerage stack. That means developers can use it not only to retrieve quotes, options chains, time-and-sales data, and historical pricing, but also to connect agentic workflows directly to trading and portfolio actions. Tradier also supports HTTP and WebSocket streaming, which is useful when building systems that need fast updates rather than purely batch-style analysis.

Tradier’s market-data positioning is more U.S.-brokerage-centric than broad all-asset-class research platforms. Real-time data is available to Tradier Brokerage account holders for U.S. stocks and options, and delayed data follows the standard 15-minute model for non-real-time access. That makes Tradier particularly compelling for execution-oriented applications rather than for the widest possible global dataset footprint.

What makes it valuable in the AI Agent era?

1. MCP and LLM-oriented documentation
Tradier is unusually forward-leaning in how it presents its docs to the LLM era. Its documentation includes llms.txt, dedicated LLM resources, and a Tradier MCP section. Tradier’s own MCP documentation says users can access market data, account details, documentation, and even place trades from within connected AI tools. That makes Tradier one of the few providers publicly bridging financial APIs and conversational interfaces in a first-class way.

2. Strong fit for execution-capable agents
Many financial APIs stop at data retrieval. Tradier goes further by combining data access with brokerage actions such as order placement, account history, positions, and balances. In the AI agent era, that matters because the most interesting systems are often not just research agents but action-taking agents. Tradier is therefore especially relevant for developers building guarded execution workflows, trading copilots, or semi-autonomous assistants that need both read and act capabilities.

3. Streaming interfaces for real-time agent loops
Tradier supports both HTTP and WebSocket streaming for market and account data. That is important for agent architectures that continuously monitor events, react to intraday changes, or trigger downstream workflows when market conditions shift. In practical terms, Tradier is better suited than batch-only APIs for event-driven agents that need live context rather than periodic polling alone.

In short

Tradier is one of the strongest options for AI agents that need to move beyond analysis into brokerage-connected workflows. It may not be the broadest general-purpose research API, but for U.S.-market, execution-aware agents, Tradier’s mix of market data, account endpoints, streaming support, and MCP/LLM resources makes it highly relevant.

Xignite

Overview

Xignite is an enterprise financial data platform centered on cloud-delivered APIs and market-data management. Its catalog covers stock quotes, ETFs and mutual funds, foreign exchange, futures and options, indices and benchmarks, fixed income and rates, company fundamentals, reference data, earnings, and news. The company also emphasizes broad upstream sourcing, stating that its data comes from more than 250 providers, alongside curated in-house datasets.

Xignite’s public positioning is less “developer hobbyist API” and more enterprise-grade market data infrastructure. It highlights unlimited-usage pricing, flexible commercial packaging by asset class, call frequency, and region, and delivery models that include real-time, historical, and reference data. Its developer materials also show a broad set of products for delayed quotes, real-time quotes, historical data, streaming, alerts, IPOs, and company information.

That means Xignite is best understood as a data platform for institutions and mature fintech products rather than as a lightweight API-first experimentation layer. For many teams, that is a feature, not a drawback. In an AI stack, the most valuable data provider is often the one that can reliably serve as the normalized source behind internal models, orchestration layers, and production analytics systems. This last point is an inference from Xignite’s product positioning toward scalable enterprise delivery and market-data management.

What makes it valuable in the AI Agent era?

1. Enterprise-grade breadth for multi-source agent pipelines
AI agents become more useful when they can combine quotes, fundamentals, benchmarks, reference data, and news into a single reasoning loop. Xignite’s catalog is strong on this dimension. Because it covers a wide range of asset classes and reference datasets, it can act as the structured data layer beneath enterprise financial copilots and internal analyst tools.

2. Strong fit for organizations building their own orchestration layer
Unlike Alpha Vantage or EODHD, Xignite’s public materials emphasize APIs, coverage, and market-data management rather than agent-specific packaging. In practice, that makes it attractive for organizations that want to build their own AI architecture on top of a robust enterprise data backbone instead of depending on vendor-supplied MCP experiences. That is an inference from Xignite’s public positioning around cloud APIs, data management, and unlimited-usage commercial structure.

3. Flexible delivery for production-scale systems
Xignite supports multiple delivery modes across real-time, delayed, historical, and streaming-style services, and it explicitly markets itself for demanding display applications, backtesting, alerts, and application integration. That flexibility matters in AI systems because not all components need the same data path: one model might need historical fundamentals, another might need event-driven market updates, and a third might need reference data normalization.

In short

Xignite is not the most visibly AI-marketed provider in this group, but it is a serious contender for enterprise AI finance stacks. If your goal is to build a proprietary agent platform on top of large-scale, normalized market-data services, Xignite’s breadth and infrastructure orientation make it more compelling than its relative lack of public AI branding might suggest.

EOD Historical Data

Overview

EOD Historical Data, now commonly presented as EODHD, offers a broad financial data platform spanning fundamentals, historical end-of-day prices, live and real-time feeds, intraday data, U.S. options, financial news, stock screeners, technical indicators, and exchange/reference datasets. On its homepage, the company positions itself as a “one-stop shop” for 30+ years of historical, fundamental, and real-time data across global markets, with coverage figures including 60 stock exchanges and 150,000 tickers.

One of EODHD’s strengths is that it sits between lightweight developer tools and more professional research infrastructure. It offers structured JSON and CSV responses, coding libraries, spreadsheet add-ons, and a broad menu of market datasets without being limited to only one narrow workflow. It also exposes precomputed technical indicators through API endpoints rather than requiring users to calculate everything from raw time series.

This combination makes EODHD particularly attractive for builders who want reasonably broad market-data coverage and analytics features in a format that remains accessible to smaller teams, solo developers, and applied AI prototypes.

What makes it valuable in the AI Agent era?

1. Official MCP support for agent integration
EODHD provides an official MCP server for financial data and explicitly documents how to connect it to ChatGPT, Claude, and custom AI agents. The company describes this as a way for AI agents and LLMs to access real-time and historical financial data directly through MCP, making EODHD one of the clearest AI-era data providers alongside Alpha Vantage and Tradier.

2. An official ChatGPT-oriented financial assistant
Beyond MCP, EODHD also offers an official Financial Assistant for ChatGPT, which it describes as an AI that can generate code for EODHD APIs and provide finance insights grounded in real data and news. That does not just signal marketing interest in AI; it suggests the company is actively shaping its product and developer experience around LLM-driven usage patterns.

3. Strong structured outputs plus higher-level analytics
EODHD’s AI relevance is also practical. It provides structured JSON/CSV outputs, extensive API documentation, libraries, and technical-indicator endpoints that already package financial signals into machine-usable form. For agentic systems, that reduces the burden of transforming raw market data before it can be used in screening, summarization, ranking, or recommendation workflows.

In short

EODHD is one of the strongest all-around options for the AI agent era. It combines broad market coverage with precomputed indicators, developer-friendly structured data, an official MCP server, and a ChatGPT-oriented assistant. For teams that want something more AI-forward than classic enterprise vendors but broader than a narrow single-purpose API, EODHD is a very strong choice.

QuoteMedia

Overview

QuoteMedia is a long-standing market-data provider focused on real-time and historical data, news, analytics, and financial information for brokerages, websites, trading systems, and investor-facing products. Its Request APIs and OnDemand services are built around cloud-based access to market data, while its streaming products emphasize tick-by-tick delivery, low latency, and enterprise-grade reliability. QuoteMedia also highlights broad operational scale, including 110+ global exchanges, 200+ data APIs, 99.99% uptime, and 100+ news providers.

A notable strength of QuoteMedia is delivery flexibility. Its platform spans REST-style OnDemand APIs, WebSocket and other streaming interfaces, and SFTP-based file services for bulk delivery. It also supports JSON, XML, CSV, option-chain data, company profiles, historical time series, filings, and custom calculations. That makes QuoteMedia less of a single API product and more of a market-data delivery platform.

QuoteMedia’s public positioning is similar to Xignite in one important way: it is more infrastructure-oriented than explicitly LLM-oriented. In other words, its clearest strengths are reliability, breadth, delivery options, and integration into financial products, not public MCP or agent-marketing. That is an inference from the official materials reviewed.

What makes it valuable in the AI Agent era?

1. Low-latency data for real-time agent monitoring
QuoteMedia’s streaming stack is designed for real-time or delayed tick-by-tick data, normalized for ease of use and optimized for single-digit millisecond performance. For AI systems that monitor live markets, score signals, or trigger alerts and workflows off intraday movement, that kind of delivery profile is highly relevant.

2. Multiple delivery modes for different agent architectures
Modern AI finance stacks are not monolithic. Some components work best with REST requests, others with streams, and others with bulk files for offline training or evaluation. QuoteMedia supports cloud REST APIs, streaming APIs, and SFTP/file services, which makes it well suited to organizations building layered pipelines that combine real-time agent behavior with batch analytics and historical model development.

3. Strong fit as a production data layer
QuoteMedia offers market data, news, analytics, company profiles, option chains, filings, and historical data in structured formats such as JSON, XML, and CSV. That breadth makes it a useful foundation for internal copilots, research dashboards, summarization systems, and client-facing financial applications where the “AI” layer is built on top of the data platform rather than bundled by the vendor itself.

In short

QuoteMedia is a strong candidate for teams that care more about production-grade delivery and integration flexibility than about whether the vendor has already branded itself around AI agents. In the AI agent era, that still matters a lot: a reliable, low-latency, multi-format market-data backbone can be more valuable than flashy AI positioning if you are building your own orchestration layer.

Conclusion

If the goal is to find the most AI-ready providers, Alpha Vantage, Tradier, and EODHD stand out because they already offer MCP or LLM-oriented support. Alpha Vantage is particularly strong for AI-native research tools, Tradier is strong for brokerage-connected agents, and EODHD is a strong general-purpose choice.

If the goal is enterprise-grade infrastructure for proprietary AI systems, Xignite and QuoteMedia remain highly relevant. They may be less visibly AI-marketed, but they are strong as scalable market data backbones.

So in the AI agent era, the best stock market data API depends on what you are building. For AI-native financial research, Alpha Vantage has a strong edge. For execution-oriented agents, Tradier stands out. For broad AI-enabled workflows, EODHD is highly competitive. For enterprise infrastructure, Xignite and QuoteMedia are still important players.

7 SQL Use Cases Every Data Professional Should Know

Cornellius Yudha Wijaya — Sat, 07 Mar 2026 12:19:16 GMT

A lot of people learn SQL in a frustrating way.

They start with SELECT, FROM, WHERE, GROUP BY, maybe a few joins, and if they stay long enough, a window function or two. They can write queries. They can pass the exercises. But when they face a real business question, they still freeze.

That usually happens because they learned SQL as a list of clauses instead of a way to think.

In real work, SQL is rarely about showing that you remember syntax. It is about knowing what question arises once it hits the data. Questions such as:

Is this a reporting problem?
A funnel problem?
A cohort problem?
A segmentation problem?
A QA problem?

The moment you can recognize that, SQL becomes much less intimidating and much more useful. That is the shift that matters.

The people who get genuinely strong at SQL are usually not the people who memorize the most functions. They are the people who can look at a business question and quickly understand what kind of data transformation it needs.

So instead of thinking about SQL as “a language I should know,” I think it is more useful to think about it as a toolkit for a handful of recurring jobs.

Here are seven of the most important ones. Let’s get into it.

Subscribe now

1. KPI reporting

When teams want to know what is happening in the business, they usually start with some version of a KPI question. Revenue by month. Daily active users. Orders by country. Average order value. Churn rate by plan. Refund rate by product. These are not flashy questions, but they are the foundation of most reporting work.

This is where SQL starts becoming practical. You are not trying to prove how advanced you are. You are trying to turn raw data into something clear enough for another person to act on.

That means defining the metric carefully, filtering the right time window, grouping at the right level, and returning a result that is readable. The technical tools are simple, but the judgment behind them matters a lot.

A lot of people underestimate this kind of SQL because it feels too basic. I think that is a mistake. A team with weak KPI logic usually ends up with weak everything else.

A simple example is monthly revenue by product category:

SELECT
    DATE_TRUNC(’month’, order_date) AS order_month,
    product_category,
    SUM(revenue) AS total_revenue
FROM orders
WHERE order_date >= DATE ‘2026-01-01’
GROUP BY 1, 2
ORDER BY 1, 3 DESC;

This is a basic grouped summary, but that is exactly why it matters. A lot of useful SQL is just good filtering, clean aggregation, and returning a table that another person can use.

2. Funnel analysis

The second major use case is figuring out where people drop off.

This is where SQL starts feeling very close to product and growth work. A funnel question usually sounds like this: how many users started onboarding, how many completed profile setup, how many created their first project, and how many upgraded? In ecommerce, the same question shows up as view product, add to cart, begin checkout, and pay.

What makes funnel analysis valuable is that it shows where interest turns into friction.

A lot of the time, the problem is not “traffic is low.” The problem is that the path breaks at one specific step. SQL helps you see that step clearly. It lets you move from a vague sense that “conversion feels weak” to a more precise question like “why do so many users disappear between signup and first action?”

A simple event-based funnel might look like this:

SELECT
    step_name,
    COUNT(DISTINCT user_id) AS users_at_step
FROM onboarding_events
WHERE event_date >= DATE ‘2026-03-01’
GROUP BY 1
ORDER BY
    CASE step_name
        WHEN ‘signup’ THEN 1
        WHEN ‘verify_email’ THEN 2
        WHEN ‘create_project’ THEN 3
        WHEN ‘first_active_use’ THEN 4
    END;

This is not the most advanced funnel query in the world, but it already gives you a clearer conversation. Instead of saying “activation is weak,” you can ask, “Why do so many users disappear between verification and first project creation?”

Once you can answer that, the conversation gets much more useful.

3. Cohort retention analysis

This is one of the most important SQL use cases because it forces better thinking.

A cohort retention analysis groups users by a shared starting point, then checks whether they come back in later periods. That sounds simple, but it is one of those areas where small definition choices change the whole story. What puts a user into a cohort? What counts as a return? What does a week mean? Should a user count once per week or every time they generate an event?

That is why good retention work is not mainly about writing SQL. It is about locking the logic before the SQL ever begins.

This is also where SQL becomes more than a reporting language. It becomes a way of expressing lifecycle behavior. Once you can build a trustworthy retention table, you can stop asking “are users coming back?” in a vague way and start asking “which users are sticking, when do they drop, and what changed across cohorts?”

That is one of the reasons I like this use case so much. It pushes people past syntax into actual analytical design.

A very small example of the logic looks like this:

WITH user_cohort AS (
    SELECT
        user_id,
        DATE_TRUNC(’week’, MIN(login_date)) AS cohort_week
    FROM logins
    GROUP BY 1
),
user_activity AS (
    SELECT
        l.user_id,
        DATE_TRUNC(’week’, l.login_date) AS activity_week
    FROM logins l
    GROUP BY 1, 2
)
SELECT
    c.cohort_week,
    a.activity_week,
    COUNT(DISTINCT a.user_id) AS active_users
FROM user_cohort c
JOIN user_activity a
  ON c.user_id = a.user_id
GROUP BY 1, 2
ORDER BY 1, 2;

This is only the skeleton, not the full retention table. But even here, you can already see the shape: assign the cohort, map later activity, then aggregate by period.

You can check the deep dive of this use case here:

4. Segmentation

Once you know the overall number, the next question is almost always: who exactly is driving it?

That is segmentation.

Averages are useful, but they hide a lot. SQL becomes much more powerful once you stop treating all users as one group and start cutting the data into meaningful slices. That might mean country, plan, acquisition channel, device type, power users versus casual users, or first purchase month.

And in practice, this is where a lot of strong SQL users separate themselves. They stop producing one big average and start showing where the business behaves differently across groups.

A simple segmentation example might be conversion rate by acquisition channel:

SELECT
    acquisition_channel,
    COUNT(DISTINCT user_id) AS users,
    SUM(CASE WHEN converted = 1 THEN 1 ELSE 0 END) AS converted_users,
    ROUND(
        1.0 * SUM(CASE WHEN converted = 1 THEN 1 ELSE 0 END)
        / COUNT(DISTINCT user_id),
        3
    ) AS conversion_rate
FROM user_conversion_summary
GROUP BY 1
ORDER BY conversion_rate DESC;

This is where SQL starts feeling strategic. You stop asking, “Is conversion improving?” and start asking, “Is conversion improving for the users we actually care about?”

5. Experiment analysis

If you work near product or growth teams, SQL becomes very important the moment experiments show up.

Before anyone talks about significance, lift, or confidence intervals, someone still has to build the dataset properly. Who was in the control group? Who was in the treatment group? Who converted? Over what window? Were there logging issues? Did the assignment logic work as expected?

A lot of that early work is SQL.

And this matters more than people think, because if the experiment table is wrong, everything that comes after it is already compromised. If the assignment table is joined incorrectly, if the outcome window is inconsistent, or if duplicated rows quietly inflate conversions, the eventual statistical discussion becomes much less meaningful.

So even though experiment analysis sounds advanced, a lot of it still comes down to careful SQL habits and clean dataset construction.

A simple experiment summary might look like this:

SELECT
    variant,
    COUNT(DISTINCT user_id) AS users,
    SUM(CASE WHEN purchased = 1 THEN 1 ELSE 0 END) AS purchasers,
    ROUND(
        1.0 * SUM(CASE WHEN purchased = 1 THEN 1 ELSE 0 END)
        / COUNT(DISTINCT user_id),
        3
    ) AS purchase_rate
FROM experiment_user_summary
WHERE experiment_name = ‘checkout_redesign_v1’
GROUP BY 1
ORDER BY 1;

That is not the full experiment analysis, but it is the foundation.

6. Data quality and QA checks

This is one of the least glamorous SQL use cases, and one of the most valuable.

A huge amount of trust in data work comes from catching bad structure early. Duplicate rows. Missing keys. Broken joins. Sudden changes in counts. Tables that stopped updating. Records that should be impossible but somehow exist anyway.

SQL is excellent for this kind of work because it is good at isolating patterns, comparing counts, checking coverage, and surfacing anomalies before they become reporting problems.

This is also one of the places where data professionals become more mature in practice. They stop using SQL only to answer the question they were asked, and they start using SQL to challenge whether the dataset itself deserves trust.

That is a very different mindset.

Once you develop it, your work usually becomes much more reliable.

For example, if you want to check for duplicate order IDs:

SELECT
    order_id,
    COUNT(*) AS row_count
FROM orders
GROUP BY 1
HAVING COUNT(*) > 1
ORDER BY row_count DESC;

This is basic, but incredibly useful.

7. Operational monitoring

The last use case is the one that makes SQL feel closest to the day-to-day operating layer of a business.

Sometimes the question is not “what happened this quarter?” Sometimes the question is “did the pipeline run?”, “are transactions missing?”, “did yesterday’s volume collapse?”, or “did a critical table stop refreshing?”

At that point, SQL is not just helping with analysis. It is helping keep the system honest.

This kind of work often lives somewhere between analytics, operations, and data engineering. You are comparing expected versus actual counts, checking daily or weekly movement, and trying to spot problems before somebody else finds them in a broken dashboard or an angry meeting.

If you only think of SQL as a tool for reports, you miss how often it becomes part of the business’s operational nervous system.

A simple monitoring query might compare day-over-day order counts:

SELECT
    order_date,
    COUNT(*) AS orders_today,
    LAG(COUNT(*)) OVER (ORDER BY order_date) AS orders_yesterday
FROM orders
GROUP BY 1
ORDER BY 1;

This is where window functions become especially useful. They let you compare each row to related rows while keeping the row-level result visible, which is exactly the kind of thing you want for trend and monitoring work.

Leave a comment

The bigger point

If you look across all seven use cases, the pattern is pretty clear.

SQL is rarely valuable because of its isolated syntax.

It is valuable because the same small set of ideas keeps getting reused across real work.

That is why strong SQL users usually do not sound like they are reciting functions. They sound like they understand data shape.

That is a much better goal than “learn more SQL syntax.”

Where to go next

If you are still early, I would not try to learn every advanced clause in one sitting.

I would focus on connecting SQL to actual problems.

That is exactly why I built the SQL track into the NBD Focus Map. The point is not to learn SQL randomly. The point is to see how the pieces fit together and start shipping small, useful work with them.

Start here

If you want the broader path, start with the Focus Map:

Focus Map

If you want the full paid system, use:

Vault: https://www.nb-data.com/p/nbd-reading-vault-paid-guided-paths
Template Index: https://www.nb-data.com/p/template-pack-index-paid
Subscriber Benefits: https://www.nb-data.com/p/subscriber-benefits

Cohort Retention in SQL

Cornellius Yudha Wijaya — Fri, 06 Mar 2026 18:39:19 GMT

Most retention tables are not wrong because the SQL is complicated.

They are wrong because the definitions are loose.

Someone says, “Let’s look at retention,” a query gets written, a heatmap shows up in a dashboard, and suddenly everyone is talking about Week 1 and Month 1 as if those numbers are objective facts. They usually are not. They are the result of choices. What counts as the start of a user’s journey? What counts as a return? What exactly is a week? What timezone are we using? Are we measuring one user once per period, or accidentally counting heavy users multiple times?

That is the real work in cohort retention. Not the division. Not the pivot table. The real work is deciding what story the table is allowed to tell.

At its core, cohort analysis is simple. You group users by a shared starting point, then measure what those users do in later periods. That is the common backbone behind most cohort SQL tutorials and warehouse implementations.

What makes it tricky is that small choices can change the story enough to change the decision.

So in this piece, I want to show you how I think about cohort retention in SQL when I want something that is not just presentable, but actually trustworthy. We will walk through a small sample dataset, turn it into a retention table step by step, and discuss the parts that often go wrong: cohort definition, return-event design, week boundaries, duplicate activity, partial cohorts, and interpretation.

Subscribe now

Start with the question, not the query

Before touching SQL, I like to ask one uncomfortable question:

What exactly do I want this retention table to help me decide?

That question matters because different cohort definitions answer different business questions.

If I group users by the week they signed up, I am usually asking something about onboarding, activation, or acquisition quality. I want to know whether new users are sticking around after entering the funnel.

If I group users by the week they first did something meaningful, I am asking something slightly different. I am saying that signup is not the real beginning of value. Maybe the real beginning is the first login, the first purchase, the first report built, or the first document uploaded. In that case, I am less interested in the funnel entry and more interested in what happens once a user actually starts using the product.

Both are valid. But they are not interchangeable.

The same holds for the return event. If I define retention as “any page view,” my table might look reassuring while hiding the fact that users are not doing anything meaningful. If I define retention as “purchase,” the metric might be more valuable but also much sparser. There is no universally correct event. There is only one event that is more or less aligned with the value loop you care about.

Then there is the time bucket. This is the part people often treat as neutral, even though it really isn’t. A daily retention table tells a different story than a weekly one. A weekly table tells a different story than a monthly one. And even the idea of a “week” is less fixed than people think. BigQuery, for example, distinguishes between WEEK, WEEK(), and ISOWEEK, and those choices affect how dates are grouped and how period differences are calculated.

That is why I think of cohort retention as a design problem before I think of it as a SQL problem.

The version we’re building here

To make this concrete, let’s keep the example small and explicit.

In this walkthrough:

A user’s cohort is the week of their first login
Retention means they performed a login in a later week
The table uses calendar weeks
Each user should count at most once per week

That last condition matters a lot. If a user logs in ten times in the same week, they are still one retained user for that week. Retention is about whether someone came back in the period, not how noisy their event stream was.

Sample data

Here is a tiny events table we can use end-to-end.

Template Pack Index (Paid)

Cornellius Yudha Wijaya — Sun, 01 Mar 2026 15:43:57 GMT

This is the index of all NBD Template Packs.

Templates are practical assets (docs/checklists/sheets) you can reuse to ship faster.

NBD Reading Vault (Paid): Guided Paths + Mini-Projects

Cornellius Yudha Wijaya — Tue, 24 Feb 2026 12:45:05 GMT

This Vault is the paid navigation layer of Non-Brand Data.

If you feel overwhelmed by the archive, use this page instead:

Pick one track (SQL / Python+ML / RAG)
Follow the reading order
Ship one mini-project at the end

✨Subscriber Benefits

Cornellius Yudha Wijaya — Sun, 22 Feb 2026 13:26:51 GMT

Non-Brand Data Subscriber Benefits

This is the up-to-date summary of what you get as a free reader, paid member, or founding member. I keep this post updated when something changes.

Full version

Free subscribers

Start here: NBD Focus Map (Free PDF)

What you get:

Focus Map plus the best posts in order
Public posts and the public archive
Subscriber chat and comment threads

Paid members

What you get:

Member-only deep-dive posts plus full archive
Monthly template pack
Vault reading list page

Founding members

What you get:

Everything in Paid
Priority feedback plus one annual review call

The annual review call is for one project or portfolio entry. We focus on what to change so it is hiring-manager-ready.

One-time purchases (optional)

These are separate from subscriptions. Buy once and reuse anytime.

Portfolio Rubric Toolkit
Data Science Resume Template (FREE)
Python Packages to Learn Data Science (e-book) (FREE)

How to redeem your benefits

All subscribers

Focus Map
Start Here

Paid members

Access member-only posts by logging in with the email you used to subscribe
Template packs are delivered by email and will also be collected in one place as the vault grows
The member vault reading list will be added here once it is published

Founding members

Reply to any email with the subject: Founding review
Include:

a link to your project/repo/write-up
What do you want feedback on. I will send the booking link, and we will schedule the 30 minutes.

Note: If you ever feel lost in the archive, do not scroll. Start with the Focus Map, or use the Vault reading list.

Creating a Daily Bulk Ingestion Pipeline for Historical Price Data and Fundamentals

Cornellius Yudha Wijaya — Thu, 19 Feb 2026 10:25:55 GMT

Photo by iridial on Unsplash

In the finance field, we are usually trying to answer two related questions at the same time:

What did the market do?
What did the business do?

Prices move every trading day, reflecting new information and expectations. However, fundamentals update more slowly and in batches because public companies report on a cycle (e.g., U.S. issuers file Form 10-Q after the first three fiscal quarters and an annual 10-K). This becomes a pain point when we are doing valuation and screening reviews, as we need to pull the data at a specific time, but that time can become inconsistent.

This is why a daily ingestion pipeline exists. It gives us a consistent record that we reuse without re-downloading or questioning what we just pulled. Instead of relying on a live fetch each time, we can maintain a small local dataset that updates on schedule and is ready for further processing.

In this article, we will learn how to develop a daily bulk ingestion pipeline for historical price and fundamental data using source data from Financial Modelling Prep (FMP).

Curious about it? Let’s get into it.

Subscribe now

Foundation

Before we move into the implementation details, it helps to treat this project as an ingestion layer built on top of an external data provider. Building this layer on top of the Financial Modeling Prep (FMP) API offers several practical benefits for financial analysis work.

First, it reduces duplication by reusing steps for requesting data, validating responses, standardising column names, and applying rules (e.g., date handling) for each symbol.

Second, it creates a single control point for the workflow, centralizing API key handling and daily logic rather than duplicating logic across scripts.

Third, it provides a stable historical record by maintaining a local dataset rather than recomputing results from live calls, thereby simplifying research and reporting.

Finally, it supports routine operation with two phases: an initial backfill to build historical coverage and a daily run to keep data current. Once scheduled, the dataset is automatically updated, ensuring a reliable workflow.

The Data Source

Let’s start building our daily ingestion pipeline by deciding which datasets we will pull from FMP. In this project, all data comes from FMP’s Stable API, which uses a single base URL and a consistent URL pattern:

https://financialmodelingprep.com/stable/

In practice, FMP provides many endpoints, but this pipeline intentionally uses only a small subset. The goal is to identify the minimum datasets required to build a reliable store of historical prices and core financial statements, without introducing optional datasets that complicate maintenance.

For this pipeline, we rely on these endpoints:

Company search (search-symbol): Lets you search by company name or partial ticker and returns candidates with symbols, names, exchanges, and currencies.
Company profile (profile): Returns the baseline company metadata you typically want to store alongside your price and fundamentals tables.
Income statement (income-statement): Provides revenue, net income, and other income statement fields over time.
Balance sheet statement (balance-sheet-statement): Provides assets, liabilities, and equity fields that help you understand the company’s financial position.
Cash flow statement (cash-flow-statement): Provides operating, investing, and financing cash flow fields, which are essential for evaluating cash generation and sustainability.
Historical end-of-day prices (historical-price-eod/full): Provides daily OHLCV and related fields for historical price storage.

These datasets are sufficient to build a clean ingestion pipeline that stores daily prices by date and financial statements by reporting period, while keeping the system simple and easy to run every day.

Project structure

This project is intentionally organised to separate the application, data storage, and entry points.

A simplified view of the project looks like this:

fmp_daily_ingestion/
├─ .github/
│  └─ workflows/
│     └─ daily_ingestion.yml
├─ app/
│  ├─ __init__.py
│  ├─ db.py
│  ├─ fmp_client.py
│  ├─ pipeline.py
│  └─ settings.py
├─ data/
│  ├─ fmp.sqlite3
│  └─ scheduler.log
├─ scripts/
│  ├─ __init__.py
│  ├─ backfill_symbols.py
│  ├─ backfill_prices.py
│  ├─ run_daily.py
│  ├─ scheduler.py
│  └─ check_db.py
├─ .env
└─ requirements.txt

Once we establish the project foundations, we will build our daily ingestion pipeline.

Step-by-Step Walkthrough

In this section, we will go through how our daily ingestion pipeline is built in each step.

Step 1: define dependencies and configuration

First, we set up the requirements.txtfile by keeping the dependencies minimal.

requests
python-dotenv
pandas
schedule

We also define our .env file which will supply runtime configuration without hardcoding secrets or machine-specific paths into code.

FMP_API_KEY=YOUR_KEY
FMP_STABLE_BASE_URL=https://financialmodelingprep.com/stable
DB_PATH=data/fmp.sqlite3

FMP_WATCHLIST=AAPL,MSFT,TSLA
FUNDAMENTALS_PERIODS_TO_REFRESH=4

REQUEST_TIMEOUT=30
REQUEST_SLEEP=0.15

FMP’s Stable API uses a single base URL and authentication through an API key passed as a query parameter.

Step 2: Establish a single configuration contract

Next, we will create a settings.pywhich would help every script and module read the configuration consistently. These settings will do the following:

load .env
validate required values (especially FMP_API_KEY)
provide defaults for optional settings

Our implementations will be looks like this:

# app/settings.py
import os
from dotenv import load_dotenv

# Load .env file explicitly
load_dotenv()

FMP_API_KEY = os.getenv(”FMP_API_KEY”)
if not FMP_API_KEY:
    raise RuntimeError(”Missing FMP_API_KEY. Set it as an environment variable or in .env file.”)

# Use Stable for fundamentals, V3 for historical prices (free-friendly).
FMP_STABLE_BASE_URL = os.getenv(”FMP_STABLE_BASE_URL”, “https://financialmodelingprep.com/stable”).rstrip(”/”)
FMP_V3_BASE_URL = os.getenv(”FMP_V3_BASE_URL”, “https://financialmodelingprep.com/api/v3”).rstrip(”/”)

WATCHLIST = [s.strip().upper() for s in os.getenv(”FMP_WATCHLIST”, “AAPL,MSFT,TSLA”).split(”,”) if s.strip()]

DB_PATH = os.getenv(”DB_PATH”, “data/fmp.sqlite3”)

# Daily fundamentals: fetch last N rows and upsert (simple + idempotent).
FUNDAMENTALS_PERIODS_TO_REFRESH = int(os.getenv(”FUNDAMENTALS_PERIODS_TO_REFRESH”, “4”))

REQUEST_TIMEOUT = int(os.getenv(”REQUEST_TIMEOUT”, “30”))
REQUEST_SLEEP = float(os.getenv(”REQUEST_SLEEP”, “0.15”))

This becomes the project’s control plane, as if you later run the project locally, in GitHub Actions, or under a scheduler, you do not change any application code, only environment values.

Step 3: Implement a Stable API client

In this section, we will build our client script in the fmp_client.py.The client should be the only script that knows how to:

build Stable URLs
attach apikey=...
enforce timeouts and basic pacing
raise clear errors when a request fails

The code we used will look like this:

from __future__ import annotations

import os
import time
from typing import Any, Dict, Optional

import requests
from urllib3.util import Retry
from requests.adapters import HTTPAdapter

from app.settings import FMP_API_KEY, FMP_STABLE_BASE_URL, REQUEST_TIMEOUT, REQUEST_SLEEP


class FMPClient:
    “”“
    Stable-only client (current docs):
      Base URL: https://financialmodelingprep.com/stable/
      Auth: apikey=

    Stable quickstart confirms base URL + apikey query auth.
    Historical EOD endpoint lives under Stable as well.
    “”“

    def __init__(
        self,
        api_key: Optional[str] = None,
        stable_base_url: Optional[str] = None,
        v3_base_url: Optional[str] = None, 
        timeout_s: Optional[int] = None,
        sleep_s: Optional[float] = None,
        session: Optional[requests.Session] = None,
    ) -> None:
        self.api_key = (api_key or FMP_API_KEY or “”).strip()
        if not self.api_key:
            raise RuntimeError(”Missing FMP_API_KEY. Set it in .env or environment variables.”)

        self.base_url = (stable_base_url or FMP_STABLE_BASE_URL or “https://financialmodelingprep.com/stable”).rstrip(”/”)
        self.timeout_s = int(timeout_s if timeout_s is not None else REQUEST_TIMEOUT)
        self.sleep_s = float(sleep_s if sleep_s is not None else REQUEST_SLEEP)
        
        self.session = session or requests.Session()
        if not session:
            # Configure retries
            retry_strategy = Retry(
                total=5,
                backoff_factor=1,
                status_forcelist=[429, 500, 502, 503, 504],
                allowed_methods=[”GET”],
                raise_on_status=True
            )
            adapter = HTTPAdapter(max_retries=retry_strategy)
            self.session.mount(”https://”, adapter)
            self.session.mount(”http://”, adapter)

    def _get_json(self, endpoint: str, params: Optional[Dict[str, Any]] = None) -> Any:
        params = dict(params or {})
        params[”apikey”] = self.api_key

        url = f”{self.base_url}/{endpoint.lstrip(’/’)}”
        resp = self.session.get(url, params=params, timeout=self.timeout_s)

        if resp.status_code == 402:
            raise RuntimeError(f”FMP 402 (Restricted Endpoint) for {url}: {resp.text[:300]}”)

        if not resp.ok:
            raise RuntimeError(f”FMP error {resp.status_code} for {url}: {resp.text[:300]}”)

        if self.sleep_s > 0:
            time.sleep(self.sleep_s)

        return resp.json()

    # Symbols
    def fetch_financial_statement_symbol_list(self) -> Any:
        “”“/stable/financial-statement-symbol-list”“”
        return self._get_json(”financial-statement-symbol-list”)

    def fetch_profile(self, symbol: str) -> Any:
        “”“/stable/profile?symbol=AAPL”“”
        return self._get_json(”profile”, {”symbol”: symbol.upper()})

    # Prices (Stable)
    def fetch_historical_price_eod_full(
        self,
        symbol: str,
        date_from: Optional[str] = None,
        date_to: Optional[str] = None,
    ) -> Any:
        “”“
        Stable historical EOD (full):
          /historical-price-eod/full?symbol=AAPL
        “”“
        params: Dict[str, Any] = {”symbol”: symbol.upper()}
        if date_from:
            params[”from”] = date_from
        if date_to:
            params[”to”] = date_to
        return self._get_json(”historical-price-eod/full”, params)

    # Fundamentals (Stable)
    def fetch_income_statement(self, symbol: str) -> Any:
        return self._get_json(”income-statement”, {”symbol”: symbol.upper()})

    def fetch_balance_sheet(self, symbol: str) -> Any:
        return self._get_json(”balance-sheet-statement”, {”symbol”: symbol.upper()})

    def fetch_cash_flow(self, symbol: str) -> Any:
        return self._get_json(”cash-flow-statement”, {”symbol”: symbol.upper()})

These endpoints correspond directly to the Stable documentation for company profile, income statement, and historical EOD prices.

Step 4: define the schema and write for the data storage

In this section, we will define what we store and how we update it safely within the db.pyfile.

The code implementation will be as follows:

import sqlite3
import json
from datetime import datetime
from typing import Optional, Sequence, Tuple


DDL = “”“
CREATE TABLE IF NOT EXISTS symbols (
  symbol TEXT PRIMARY KEY,
  name TEXT,
  exchange TEXT,
  currency TEXT
);

CREATE TABLE IF NOT EXISTS prices_eod (
  symbol TEXT NOT NULL,
  date TEXT NOT NULL,
  open REAL,
  high REAL,
  low REAL,
  close REAL,
  volume REAL,
  PRIMARY KEY (symbol, date)
);

CREATE TABLE IF NOT EXISTS financials (
  symbol TEXT NOT NULL,
  period_end_date TEXT NOT NULL,
  statement_type TEXT NOT NULL,
  year INTEGER,
  period TEXT,
  payload_json TEXT NOT NULL,
  PRIMARY KEY (symbol, period_end_date, statement_type)
);
“”“


def connect(db_path: str) -> sqlite3.Connection:
    import os
    os.makedirs(os.path.dirname(db_path), exist_ok=True)
    conn = sqlite3.connect(db_path)
    conn.execute(”PRAGMA journal_mode=WAL;”)
    conn.execute(”PRAGMA synchronous=NORMAL;”)
    return conn


def init_db(conn: sqlite3.Connection) -> None:
    conn.executescript(DDL)
    conn.commit()


def upsert_symbols(conn: sqlite3.Connection, rows: Sequence[Tuple]) -> None:
    conn.executemany(
        “”“
        INSERT INTO symbols (symbol, name, exchange, currency)
        VALUES (?, ?, ?, ?)
        ON CONFLICT(symbol) DO UPDATE SET
            name=excluded.name,
            exchange=excluded.exchange,
            currency=excluded.currency
        “”“,
        rows,
    )
    conn.commit()


def upsert_prices(conn: sqlite3.Connection, rows: Sequence[Tuple]) -> None:
    conn.executemany(
        “”“
        INSERT INTO prices_eod (symbol, date, open, high, low, close, volume)
        VALUES (?, ?, ?, ?, ?, ?, ?)
        ON CONFLICT(symbol, date) DO UPDATE SET
            open=excluded.open,
            high=excluded.high,
            low=excluded.low,
            close=excluded.close,
            volume=excluded.volume
        “”“,
        rows,
    )
    conn.commit()


def upsert_financials(conn: sqlite3.Connection, rows: Sequence[Tuple]) -> None:
    conn.executemany(
        “”“
        INSERT INTO financials (symbol, period_end_date, statement_type, year, period, payload_json)
        VALUES (?, ?, ?, ?, ?, ?)
        ON CONFLICT(symbol, period_end_date, statement_type) DO UPDATE SET
            year=excluded.year,
            period=excluded.period,
            payload_json=excluded.payload_json
        “”“,
        rows,
    )
    conn.commit()


def read_symbols(conn: sqlite3.Connection, limit: Optional[int] = None) -> list[str]:
    q = “SELECT symbol FROM symbols ORDER BY symbol”
    if limit:
        q += “ LIMIT ?”
        cur = conn.execute(q, (limit,))
    else:
        cur = conn.execute(q)
    return [r[0] for r in cur.fetchall()]

The code above is designed as follows:

symbols is the reference table
prices_eod stores daily OHLCV keyed by (symbol, date)
financials stores statement rows keyed by (symbol, period_end_date, statement_type)

The purpose of this layer is not only persistence but also operational reliability. With primary keys and upserts in place, we can rerun backfills and daily jobs without creating duplicates.

Step 5: Convert API responses into data rows

In this section, we will define the pipeline.pywhere this script defines the ingestion rules. The script should do the following:

normalize FMP response shapes
shape raw records into tuples that match table definitions
return those tuples so the DB layer can upsert them

The whole code implementation is as follows:

from __future__ import annotations

import json
import sqlite3
from typing import Any, Dict, Iterable, List, Optional, Tuple

from app.fmp_client import FMPClient
from app.db import upsert_symbols, upsert_prices, upsert_financials, read_symbols


def _as_list(payload: Any) -> List[Dict[str, Any]]:
    “”“
    Stable endpoints typically return a JSON array.
    This helper makes the pipeline robust if the response is wrapped.
    “”“
    if isinstance(payload, list):
        return [x for x in payload if isinstance(x, dict)]
    if isinstance(payload, dict):
        for key in (”data”, “results”, “historical”):
            v = payload.get(key)
            if isinstance(v, list):
                return [x for x in v if isinstance(x, dict)]
    return []


# 1) Symbols

def seed_symbols(conn: sqlite3.Connection, client: FMPClient, symbols: Optional[Iterable[str]] = None) -> int:
    “”“
    Seeds the symbols table. If symbols are provided, it enriches them via /profile.
    If none provided, it could fetch a global list (but free tier usually restricts this).
    Returns count of symbols processed.
    “”“
    if symbols:
        rows: List[Tuple] = []
        for s in symbols:
            sym = s.strip().upper()
            if not sym:
                continue
            prof = client.fetch_profile(sym)
            p = _as_list(prof)
            row = p[0] if p else {}
            
            name = row.get(”companyName”) or row.get(”name”)
            exchange = row.get(”exchange”) or row.get(”exchangeShortName”)
            currency = row.get(”currency”)
            
            rows.append((sym, name, exchange, currency))
        
        if rows:
            upsert_symbols(conn, rows)
        return len(rows)
    else:
        # Fallback to fetching a list if possible (Stable API allows financial-statement-symbol-list)
        payload = client.fetch_financial_statement_symbol_list()
        items = _as_list(payload)
        rows = []
        for r in items:
            sym = (r.get(”symbol”) or r.get(”ticker”) or “”).strip().upper()
            if not sym:
                continue
            rows.append((
                sym,
                r.get(”name”) or r.get(”companyName”),
                r.get(”exchange”) or r.get(”exchangeShortName”),
                r.get(”currency”)
            ))
        if rows:
            upsert_symbols(conn, rows)
        return len(rows)

# 2) Prices

def backfill_prices_for_symbol(
    client: FMPClient,
    symbol: str,
    date_from: Optional[str] = None,
    date_to: Optional[str] = None,
    timeseries: Optional[int] = None,  # Legacy, ignored or used as slice
) -> List[Tuple]:
    “”“
    Returns rows for upsert_prices:
      (symbol, date, open, high, low, close, volume)
    “”“
    sym = symbol.strip().upper()
    payload = client.fetch_historical_price_eod_full(sym, date_from=date_from, date_to=date_to)
    bars = _as_list(payload)

    if timeseries:
        bars = bars[-int(timeseries):]

    out: List[Tuple] = []
    for b in bars:
        dt = b.get(”date”) or b.get(”datetime”) or b.get(”time”)
        if not dt:
            continue
        out.append((
            sym,
            str(dt),
            b.get(”open”),
            b.get(”high”),
            b.get(”low”),
            b.get(”close”),
            b.get(”volume”)
        ))
    return out


def ingest_prices_for_date(
    conn: sqlite3.Connection,
    client: FMPClient,
    symbols: Iterable[str],
    target_date: str
) -> int:
    “”“
    Daily run: Fetch exactly one day per symbol and upsert.
    “”“
    total = 0
    for s in symbols:
        rows = backfill_prices_for_symbol(client, s, date_from=target_date, date_to=target_date)
        if rows:
            upsert_prices(conn, rows)
            total += len(rows)
    return total

# 3) Fundamentals
def refresh_fundamentals(
    conn: sqlite3.Connection,
    client: FMPClient,
    symbols: Iterable[str],
    last_n: int = 4
) -> int:
    “”“
    Refreshes the latest N financial statements for a watchlist.
    “”“
    total = 0
    for s in symbols:
        sym = s.strip().upper()
        bundles = [
            (”income_statement”, client.fetch_income_statement(sym)),
            (”balance_sheet”, client.fetch_balance_sheet(sym)),
            (”cash_flow”, client.fetch_cash_flow(sym)),
        ]

        rows_to_upsert = []
        for statement_type, payload in bundles:
            rows = _as_list(payload)
            for r in rows[: int(last_n)]:
                period_end = r.get(”date”)
                if not period_end:
                    continue
                
                year = r.get(”calendarYear”) or r.get(”year”)
                period = r.get(”period”)
                
                rows_to_upsert.append((
                    sym,
                    str(period_end),
                    statement_type,
                    year,
                    period,
                    json.dumps(r, ensure_ascii=False)
                ))
        
        if rows_to_upsert:
            upsert_financials(conn, rows_to_upsert)
            total += len(rows_to_upsert)
            
    return total

From there, the pipeline functions become our project lifecycle:

Symbols seeding enriches a watchlist using the profile endpoint and creates rows for symbols. The profile endpoint is documented with symbol as a required query parameter.
Price backfill fetches historical EOD bars, maps each bar to (symbol, date, open, high, low, close, volume), then returns rows to be upserted into prices_eod.
Daily ingestion uses the same shaping rules but narrows the request to a single target date (typically yesterday), ensuring the daily mode is not a separate system but a constrained version of the same ingestion path.
Fundamentals refresh fetches the latest statement rows and stores them under a composite key.

The central principle is consistency for all the data we acquired from the FMP API.

Step 6: Create runnable entry points

The scripts folder exists so we can run the pipeline without writing the code each time. Each script should follow the same pattern:

import settings
connect and initialise DB
instantiate FMPClient
call pipeline functions
upsert results
print a concise summary

In this project, the scripts map directly to operational phases:

backfill_symbols.py seeds your symbols table from WATCHLIST :

import sys
from pathlib import Path

# Add project root to sys.path
sys.path.append(str(Path(__file__).parent.parent))

from app.settings import (
    DB_PATH, FMP_API_KEY, FMP_STABLE_BASE_URL, FMP_V3_BASE_URL, WATCHLIST
)
from app.db import connect, init_db
from app.fmp_client import FMPClient
from app.pipeline import seed_symbols


def main():
    conn = connect(DB_PATH)
    init_db(conn)

    client = FMPClient(
        api_key=FMP_API_KEY,
        stable_base_url=FMP_STABLE_BASE_URL,
        v3_base_url=FMP_V3_BASE_URL,
    )

    n = seed_symbols(conn, client, WATCHLIST)
    print(f”Seeded {n} symbols into DB ({DB_PATH}) from WATCHLIST.”)


if __name__ == “__main__”:
    main()

backfill_prices.py performs historical loading for prices_eod

import sys
from pathlib import Path

# Add project root to sys.path
sys.path.append(str(Path(__file__).parent.parent))

import argparse

from app.settings import (
    DB_PATH, FMP_API_KEY, FMP_STABLE_BASE_URL, FMP_V3_BASE_URL, WATCHLIST
)
from app.db import connect, init_db, read_symbols, upsert_prices
from app.fmp_client import FMPClient
from app.pipeline import backfill_prices_for_symbol


def main() -> None:
    ap = argparse.ArgumentParser()
    ap.add_argument(”--limit”, type=int, default=None, help=”Backfill only first N symbols from DB”)
    ap.add_argument(”--symbols”, type=str, default=None, help=”Comma-separated tickers (overrides WATCHLIST)”)

    # Optional: limit how much history you pull
    ap.add_argument(”--from-date”, type=str, default=None, help=”YYYY-MM-DD”)
    ap.add_argument(”--to-date”, type=str, default=None, help=”YYYY-MM-DD”)
    ap.add_argument(”--timeseries”, type=int, default=None, help=”Return last N days”)

    args = ap.parse_args()

    conn = connect(DB_PATH)
    init_db(conn)

    if args.symbols:
        symbols = [s.strip().upper() for s in args.symbols.split(”,”) if s.strip()]
    else:
        # Defaults to watchlist if DB is empty or use current symbols
        db_syms = read_symbols(conn, limit=args.limit)
        symbols = db_syms if db_syms else WATCHLIST

    client = FMPClient(
        api_key=FMP_API_KEY,
        stable_base_url=FMP_STABLE_BASE_URL,
        v3_base_url=FMP_V3_BASE_URL,
    )

    total_rows = 0
    for i, sym in enumerate(symbols, 1):
        rows = backfill_prices_for_symbol(
            client,
            sym,
            date_from=args.from_date,
            date_to=args.to_date,
            timeseries=args.timeseries,
        )
        if rows:
            upsert_prices(conn, rows)
            total_rows += len(rows)

        if i % 25 == 0:
            print(f”Processed {i}/{len(symbols)} symbols...”)

    print(f”Done. Upserted {total_rows} price rows.”)


if __name__ == “__main__”:
    main()

run_daily.py runs the daily refresh (yesterday’s prices + latest fundamentals)

import sys
from pathlib import Path

# Add project root to sys.path
sys.path.append(str(Path(__file__).parent.parent))

import datetime as dt

from app.settings import (
    DB_PATH, FMP_API_KEY, FMP_STABLE_BASE_URL, FMP_V3_BASE_URL, WATCHLIST, FUNDAMENTALS_PERIODS_TO_REFRESH
)
from app.db import connect, init_db
from app.fmp_client import FMPClient
from app.pipeline import ingest_prices_for_date, refresh_fundamentals


def main():
    # Defensive check: today - 1 day
    target_date = (dt.date.today() - dt.timedelta(days=1)).isoformat()

    conn = connect(DB_PATH)
    init_db(conn)

    client = FMPClient(
        api_key=FMP_API_KEY,
        stable_base_url=FMP_STABLE_BASE_URL,
        v3_base_url=FMP_V3_BASE_URL,
    )

    n_prices = ingest_prices_for_date(conn, client, WATCHLIST, target_date)
    n_fin = refresh_fundamentals(conn, client, WATCHLIST, last_n=FUNDAMENTALS_PERIODS_TO_REFRESH)

    print(f”[{target_date}] upserted {n_prices} price rows and {n_fin} fundamentals rows.”)


if __name__ == “__main__”:
    main()

scheduler.py runs run_daily.py on a local schedule and logs output

import sys
from pathlib import Path

# Add project root to sys.path
sys.path.append(str(Path(__file__).parent.parent))

import time
import schedule
import subprocess
import logging

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format=’%(asctime)s - %(levelname)s - %(message)s’,
    handlers=[
        logging.FileHandler(”data/scheduler.log”),
        logging.StreamHandler()
    ]
)

def run_job():
    logging.info(”Starting daily ingestion job...”)
    try:
        # Run run_daily.py as a subprocess
        result = subprocess.run(
            [sys.executable, “scripts/run_daily.py”],
            capture_output=True,
            text=True,
            check=True
        )
        logging.info(f”Job completed successfully:\n{result.stdout}”)
    except subprocess.CalledProcessError as e:
        logging.error(f”Job failed with error:\n{e.stderr}”)
    except Exception as e:
        logging.error(f”An unexpected error occurred: {e}”)

def main():
    # Schedule the job for 01:00 AM every day
    # You can change this time as needed
    schedule.every().day.at(”01:00”).do(run_job)
    
    logging.info(”Scheduler started. Ingestion job scheduled for 01:00 AM daily.”)
    logging.info(”Press Ctrl+C to exit.”)

    try:
        while True:
            schedule.run_pending()
            time.sleep(60) # Check every minute
    except KeyboardInterrupt:
        logging.info(”Scheduler stopped by user.”)

if __name__ == “__main__”:
    main()

check_db.py verifies table counts, date ranges, and recent rows

import sys
from pathlib import Path

# Add project root to sys.path
sys.path.append(str(Path(__file__).parent.parent))

import sqlite3
import pandas as pd
from app.settings import DB_PATH

def main():
    print(f”Checking database at: {DB_PATH}”)
    
    con = sqlite3.connect(DB_PATH)

    try:
        print(”\n--- Row Counts ---”)
        print(pd.read_sql(”SELECT COUNT(*) AS n FROM symbols”, con))
        print(pd.read_sql(”SELECT COUNT(*) AS n FROM prices_eod”, con))
        print(pd.read_sql(”SELECT COUNT(*) AS n FROM financials”, con))

        print(”\n--- Price Statistics ---”)
        print(pd.read_sql(”SELECT MIN(date) AS min_date, MAX(date) AS max_date FROM prices_eod”, con))
        
        print(”\n--- Recent Prices (Last 5) ---”)
        print(pd.read_sql(”SELECT * FROM prices_eod ORDER BY date DESC LIMIT 5”, con))
        
        print(”\n--- Fundamentals Breakdown ---”)
        print(pd.read_sql(”SELECT statement_type, COUNT(*) AS n FROM financials GROUP BY statement_type”, con))
    except Exception as e:
        print(f”Error checking DB: {e}”)
    finally:
        con.close()

if __name__ == “__main__”:
    main()

This separation keeps the project maintainable and we are able to improve the pipeline in the future.

Step 7: The database generation

The data/ folder will contain the generated state:

fmp.sqlite3 (Our SQLite database)
scheduler.log (Our local scheduler audit trail, if you use it)

Nothing in data/ should be required for understanding the code. It is the product of running the pipeline.

Step 8: Scheduling (local or GitHub Actions)

We have two scheduling modes, which run locally or using GitHub Actions.

Local scheduling (scripts/scheduler.py) triggers the daily job at a fixed time and writes logs to data/scheduler.log. It is the simplest option when you control the machine.
GitHub Actions scheduling (.github/workflows/daily_ingestion.yml) runs the same daily script on a cron schedule and stores the SQLite database as a workflow artifact. GitHub’s scheduled workflows are driven by cron syntax and operate in UTC. We can use the YAML file:

name: Daily Data Ingestion

on:
  schedule:
    # Runs at 02:00 UTC every day
    - cron: ‘0 2 * * *’
  workflow_dispatch:
    # Allows manual triggering

jobs:
  ingest:
    runs-on: ubuntu-latest

    steps:
    - name: Checkout code
      uses: actions/checkout@v3

    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: ‘3.10’

    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt

    - name: Run daily ingestion
      env:
        FMP_API_KEY: ${{ secrets.FMP_API_KEY }}
        FMP_WATCHLIST: ${{ vars.FMP_WATCHLIST }}
        DB_PATH: data/fmp.sqlite3
      run: |
        python scripts/run_daily.py

    - name: Upload database
      uses: actions/upload-artifact@v3
      with:
        name: fmp-database
        path: data/fmp.sqlite3

The important part is that both modes execute the same run_daily.py entry point and therefore share the same ingestion behaviour.

That is all for the project structure that we built for the daily ingestion pipeline. In the next section, we will go through how to run them step-by-step.

Running the scripts

All operational entry points is exist in the scriptsfolder. Each script adds the project root to sys.path, so the recommended way to execute them is from the repository root using python scripts/.py.

1. Install dependencies

From the project root:

pip install -r requirements.txt

This installs the minimal runtime stack (requests, python-dotenv, pandas, schedule).

2. Configure `.env`

Before running anything, ensure .env defines at least:

FMP_API_KEY (Acquired the key from the FMP site)
FMP_WATCHLIST (comma-separated tickers)
DB_PATH (for example data/fmp.sqlite3)

Our scripts read these values through app/settings.py and use them consistently across the pipeline.

3. Seed symbols into the database

Run the following script:

python scripts/backfill_symbols.py

This script connects to the SQLite database at DB_PATH, initializes the schema, instantiates FMPClient, and seeds the symbols table using your WATCHLIST.

When it completes, it prints a confirmation of the number of symbols seeded and the database file used. Something like:

Done. Upserted 3768 price rows.

4. Backfill historical prices

Run the following script:

python scripts/backfill_prices.py

This script is the one-time historical loader for prices_eod. It also initializes the database schema before writing. The example result is shown below:

Seeded 3 symbols into DB (data/fmp.sqlite3) from WATCHLIST.

Symbol selection follows the rule: if you provide --symbols, it uses that list; otherwise, it reads from the database and falls back to WATCHLIST if the database is empty.

You can keep the backfill controlled during testing or writing by using the optional arguments defined in the script:

# Backfill only specific tickers
python scripts/backfill_prices.py --symbols AAPL,MSFT

# Backfill only first N symbols read from the DB
python scripts/backfill_prices.py --limit 10

# Limit history by date range
python scripts/backfill_prices.py --symbols AAPL --from-date 2024-01-01 --to-date 2024-12-31

# Limit history by “last N days” returned
python scripts/backfill_prices.py --symbols AAPL --timeseries 200

These flags correspond directly to the script’s argument parser (--limit, --symbols, --from-date, --to-date, --timeseries).

During execution, it prints progress every 25 symbols and ends with the total number of upserted price rows.

5. Run the daily ingestion job

Run the following script:

python scripts/run_daily.py

This is the daily operational entry point, where it computes thetarget_date as today minus one day, then performs two actions, which are price ingestion for that date and refreshes fundamentals for the watchlist. The fundamentals refresh window is controlled by FUNDAMENTALS_PERIODS_TO_REFRESH.

For example, the result is as following:

[2026-02-13] upserted 3 price rows and 36 fundamentals rows.

6. Verify what was stored in SQLite

Run the following script:

python scripts/check_db.py

This script is your verification tool. It prints row counts for symbols, prices_eod, and financials, shows min/max dates in prices_eod, prints the last five price rows, and summarizes fundamentals by statement_type.

The example result is as following:

Checking database at: data/fmp.sqlite3

--- Row Counts ---
   n
0  3
      n
0  3777
    n
0  36

--- Price Statistics ---
     min_date    max_date
0  2021-02-10  2026-02-13

--- Recent Prices (Last 5) ---
  symbol        date    open    high     low   close      volume
0   AAPL  2026-02-13  262.01  262.23  255.45  255.78  54927132.0
1   MSFT  2026-02-13  404.45  405.54  398.05  401.32  33949805.0
2   TSLA  2026-02-13  414.31  424.06  410.88  417.44  50565054.0
3   AAPL  2026-02-12  275.59  275.72  260.18  261.73  81077229.0
4   MSFT  2026-02-12  405.00  406.20  398.01  401.84  40802400.0

--- Fundamentals Breakdown ---
     statement_type   n
0     balance_sheet  12
1         cash_flow  12
2  income_statement  12

This script is used for a quick check after backfills or the daily job.

7. Automate the daily run

First, let’s take a look at the local scheduler, which runs on our machine:

Run the following script:

python scripts/scheduler.py

This schedules the job daily at 01:00 AM and runs scripts/run_daily.py as a subprocess, writing logs to data/scheduler.log and to stdout.

GitHub Actions (hosted schedule)

The workflow runs at 02:00 UTC daily, sets FMP_API_KEY, FMP_WATCHLIST, and DB_PATH=data/fmp.sqlite3, then executes python scripts/run_daily.py and uploads the SQLite file as an artifact. This script runs only when we push it to the GitHub repository.

That’s all you need to understand how to build the daily ingestion pipeline with FMP.

Conclusion

In this article, we have learned how to build a small but reliable daily ingestion workflow that keeps two core financial datasets current: end-of-day prices and company fundamentals.

By relying on Financial Modeling Prep’s Stable API as the single upstream source, the pipeline remains consistent in how it authenticates, requests data, and standardizes responses, while remaining practical for routine use in research, screening, and internal analytics.

I hope it has helped!

How to Build an Earnings Briefing Engine Using the FMP API

Cornellius Yudha Wijaya — Fri, 13 Feb 2026 02:21:05 GMT

Photo by Jakub Żerdzicki on Unsplash

Earnings weeks are a compression problem. Many companies report within a short window, yet the preparation work is scattered across multiple sources.

The workflow is repetitive and time-sensitive, especially when you follow more than a few tickers. When you assemble earnings context one company at a time, you repeat the same steps for every symbol. Over time, briefings become inconsistent because each run follows a slightly different process.

This is where an earnings briefing engine becomes useful. It converts an ad hoc workflow into a repeatable pipeline, producing a consistent one-page brief for each ticker. It also makes the process easier to audit and extend over time.

In this article, we will build a minimal earnings briefing engine using Financial Modeling Prep’s stable endpoints.

Curious about it? Let’s get into it.

Subscribe now

Foundation

An earnings briefing engine is a compact workflow that summarizes the key information you need before an earnings event. It does not attempt to forecast returns, and it does not replace deep research. However, the purpose is more operational as it creates a briefing document we can store and reuse.

The Earnings Briefing Engine that we will build comprises three things:

Fetch upcoming earnings events for a date window using the earnings calendar endpoint.
Build a standardized per-ticker bundle that captures event context, expectations, and recent financial performance.
Generate a consistent one-page briefing in a simple format.

The Data Source

This project uses Financial Modeling Prep (FMP) as the primary data source. FMP publishes an extensive catalog of financial datasets through its Stable API. The platform provides over 100 documented endpoints. It also offers additional delivery options, including WebSocket streaming and bulk downloads for selected datasets.

FMP Stable uses a simple base URL, and authentication is handled via an API key passed as a query parameter.

Base URL: https://financialmodelingprep.com/stable/
Auth: apikey=

This briefing engine is built around a small set of Stable endpoints. Each endpoint maps to a section in the final one-page brief:

Earnings Calendar (earnings-calendar) provides upcoming and past earnings events. It includes the announcement date and EPS fields when available.
Analyst Estimates (analyst-estimates) provides forecasted revenue and EPS. This supports the market expectations section.
Company Profile (profile) provides a company snapshot such as sector, price, and market capitalization.
Income Statement (income-statement) provides historical statement rows for trend context.
Key Metrics (key-metrics) provides common KPIs used for compact metric blocks.

These are the data we will retrieve from the FMP API, and we will build the system based on it.

What the Earnings Briefing Engine Does

Pull upcoming earnings events for a date window
The engine queries the Stable Earnings Calendar endpoint with from and to. It returns upcoming announcements and may include EPS fields when available.
Extract symbols and de-duplicate
From the calendar response, the engine extracts symbol values. It keeps first occurrence order and removes duplicates.
Fetch a fixed dataset per symbol
For each ticker, the engine calls a small and explicit set of endpoints: >Company Profile for the snapshot context.
>Analyst Estimates for revenue and EPS expectations.
>Optional fundamentals endpoints, such as Income Statement and Key Metrics, when you want trend and KPI blocks.
Normalize responses into a stable ticker bundle
Each API response is mapped into a predictable internal schema. Missing datasets become empty objects or empty lists.
Render a one-page briefing from the bundle
A single renderer transforms the bundle into a consistent Markdown brief.
Save outputs to disk for reuse
Each ticker corresponds to a Markdown file in the output folder.
Repeat the workflow with different inputs
We can rerun the engine with a different date window or a watchlist.

Project Architecture

This project stays intentionally small. The goal is a single, clear pipeline with two entry points, rather than spreading behavior across many scripts.

earnings_briefing_engine/
├─ app/
│  ├─ __init__.py
│  ├─ config.py            # loads API key and stable base URL once
│  ├─ fmp_client.py        # HTTP wrapper, apikey injection, error handling
│  ├─ engine.py            # calendar or watchlist → bundle → briefing orchestration
│  └─ render_markdown.py   # one-page Markdown template renderer
├─ output/
│  └─ briefings/           # generated files, one per ticker
├─ .env                    # local configuration
├─ requirements.txt
├─ run.py                  # upcoming earnings window mode
├─ run_watchlist.py        # fixed watchlist mode
└─ output.txt              # optional run log or notes

Here are explanations for each of the scripts’ purposes:

app/config.py

Stores the single source of truth for configuration. This includes FMP_BASE_URL=https://financialmodelingprep.com/stable and your API key. FMP authenticates requests by appending apikey=... to each request.

app/fmp_client.py

A thin client that constructs URLs, attaches apikey, sets timeouts, and normalizes errors. This keeps API details out of the business logic. The calling pattern follows FMP’s Stable base URL and query authentication.

app/engine.py

The orchestration layer. It runs the numbered flow defined earlier:

In calendar mode, it calls earnings-calendar with from and to.
It extracts and de-duplicates symbols.
It fetches a fixed set of per-ticker datasets, then normalizes them into a stable bundle.
In watchlist mode, it can populate the event context with the per-company earnings endpoint earnings.
It then calls the renderer and writes output files.

app/render_markdown.py

Converts the normalized bundle into a one-page briefing with consistent headings. Markdown is used because it is portable, diffable, and easy to store. You can add HTML or PDF later without changing the data pipeline.

output/briefings/

Holds the generated artifacts. A practical convention is one file per ticker per event date, for example AAPL_2026-02-06.md. This creates a durable record you can re-run and compare over time.

Building the Earnings Briefing Engine

Let’s start to build our engine. We will break it down step-by-step.

Step 1: Create the environment

Start with a virtual environment and install only what you need by filling the requirements.txt.

requests>=2.31.0
python-dotenv>=1.0.0

We will using the requests for API calls and python-dotenv to load secrets from .env

Step 2: Add a `.env` file for configuration

Create .env at the project root and store:

FMP_API_KEY=YOUR_API_KEY
FMP_BASE_URL=https://financialmodelingprep.com/stable

The Stable base URL is the canonical starting point for the endpoints used in this tutorial.

Step 3: Load settings once in `app/config.py`

Keep configuration in one place. The engine should not read environment variables inside business logic. It should receive a settings object.

The config.py will have the following code:

import os
from dataclasses import dataclass

from dotenv import load_dotenv

load_dotenv()

@dataclass(frozen=True)
class Settings:
    api_key: str
    base_url: str
    out_dir: str = “output/briefings”


def get_settings() -> Settings:
    “”“
    This project targets FMP Stable endpoints:
      https://financialmodelingprep.com/stable/...
    “”“
    api_key = os.getenv(”FMP_API_KEY”, “”).strip()
    base_url = os.getenv(”FMP_BASE_URL”, “”).strip()

    if not api_key:
        raise RuntimeError(”Missing FMP_API_KEY. Set it in your environment or in a .env file.”)

    # Default to Stable API.
    if not base_url:
        base_url = “https://financialmodelingprep.com/stable”

    # Auto-correct common misconfiguration.
    if “/api/v3” in base_url:
        base_url = “https://financialmodelingprep.com/stable”

    return Settings(api_key=api_key, base_url=base_url)

In this step, we define:

api_key
base_url
out_dir

This aligns with the Stable API pattern and ensures consistent requests.

Step 4: Build a small in `app/fmp_client.py`

Next, we will build our FMP Client using the following code:

from __future__ import annotations

from dataclasses import dataclass
from typing import Any, Dict, Optional

import requests


def _redact_apikey(url: str) -> str:
    if “apikey=” not in url:
        return url
    return url.split(”apikey=”)[0] + “apikey=REDACTED”


@dataclass(frozen=True)
class FmpClient:
    “”“
    Minimal HTTP client for FMP Stable endpoints.
    “”“
    api_key: str
    base_url: str = “https://financialmodelingprep.com/stable”
    timeout_s: int = 30
    max_retries: int = 2  # only for 429

    def get_json(
        self,
        path: str,
        params: Optional[Dict[str, Any]] = None,
        *,
        allow_plan_errors: bool = True,
    ) -> Any:
        “”“
        If allow_plan_errors is True:
          - 402 (Payment Required) -> None
          - 403 (Forbidden) -> None
        “”“
        base = self.base_url.rstrip(”/”)
        url = f”{base}/{path.lstrip(’/’)}”

        q = dict(params or {})
        q[”apikey”] = self.api_key

        attempts = 0
        while True:
            attempts += 1
            resp = requests.get(url, params=q, timeout=self.timeout_s)

            if allow_plan_errors and resp.status_code in (402, 403):
                return None

            if resp.status_code == 429 and attempts <= self.max_retries:
                retry_after = resp.headers.get(”Retry-After”)
                wait_s = int(retry_after) if (retry_after and retry_after.isdigit()) else (1 + attempts)
                import time
                time.sleep(wait_s)
                continue

            if resp.status_code == 401:
                raise requests.HTTPError(f”Unauthorized (401) for {_redact_apikey(resp.url)}”, response=resp)

            resp.raise_for_status()
            return resp.json()

Our client should do only four things:

Construct base_url + path
Attach apikey to query parameters
Set timeouts
Normalize common errors

FMP documents API key usage via query parameters, and also notes header-based auth as an alternative.

5. Implement the full pipeline in `app/engine.py`

We will implement the whole Earnings Briefing within the engine.py with the code below:

from __future__ import annotations

from datetime import date, timedelta
from pathlib import Path
from typing import Any, Dict, List, Optional

from app.fmp_client import FmpClient
from app.render_markdown import render_markdown


def _dedupe_keep_order(items: List[str]) -> List[str]:
    seen = set()
    out: List[str] = []
    for s in items:
        s = (s or “”).strip().upper()
        if s and s not in seen:
            seen.add(s)
            out.append(s)
    return out

def _first_dict(x: Any) -> Dict[str, Any]:
    return x[0] if isinstance(x, list) and x and isinstance(x[0], dict) else {}


def fetch_earnings_calendar(client: FmpClient, start: date, end: date) -> List[Dict[str, Any]]:
    “”“
    Earnings Calendar (stable):
      GET /earnings-calendar?from=YYYY-MM-DD&to=YYYY-MM-DD
    Docs: https://financialmodelingprep.com/stable/earnings-calendar
    “”“
    data = client.get_json(
        “earnings-calendar”,
        {”from”: start.isoformat(), “to”: end.isoformat()},
        allow_plan_errors=True,
    )
    return data or []


def fetch_profile(client: FmpClient, symbol: str) -> Dict[str, Any]:
    “”“
    Company Profile (stable):
      GET /profile?symbol=SYMBOL
    Docs: https://financialmodelingprep.com/stable/profile?symbol=AAPL
    “”“
    data = client.get_json(”profile”, {”symbol”: symbol}, allow_plan_errors=True)
    return _first_dict(data)


# Optional (may be plan-limited depending on account)
def fetch_analyst_estimates(client: FmpClient, symbol: str, *, period: str = “quarter”, limit: int = 8, page: int = 0) -> List[Dict[str, Any]]:
    data = client.get_json(
        “analyst-estimates”,
        {”symbol”: symbol, “period”: period, “page”: page, “limit”: limit},
        allow_plan_errors=True,
    )
    return data or []


def fetch_income_statement(client: FmpClient, symbol: str, *, period: str = “quarter”, limit: int = 8) -> List[Dict[str, Any]]:
    data = client.get_json(
        “income-statement”,
        {”symbol”: symbol, “period”: period, “limit”: limit},
        allow_plan_errors=True,
    )
    return data or []


def fetch_key_metrics(client: FmpClient, symbol: str, *, period: str = “quarter”, limit: int = 8) -> List[Dict[str, Any]]:
    data = client.get_json(
        “key-metrics”,
        {”symbol”: symbol, “period”: period, “limit”: limit},
        allow_plan_errors=True,
    )
    return data or []


def fetch_stock_news(client: FmpClient, symbol: str, *, limit: int = 20) -> List[Dict[str, Any]]:
    data = client.get_json(”news/stock”, {”symbols”: symbol, “limit”: limit}, allow_plan_errors=True)
    return data or []


def fetch_press_releases(client: FmpClient, symbol: str, *, limit: int = 20) -> List[Dict[str, Any]]:
    data = client.get_json(”news/press-releases”, {”symbols”: symbol, “limit”: limit}, allow_plan_errors=True)
    return data or []


def build_bundle(
    client: FmpClient,
    symbol: str,
    *,
    event: Optional[Dict[str, Any]] = None,
    include_estimates: bool = False,
    include_financials: bool = False,
    include_news: bool = False,
    statements_period: str = “quarter”,
    statements_limit: int = 8,
) -> Dict[str, Any]:
    profile = fetch_profile(client, symbol)

    estimates: List[Dict[str, Any]] = []
    income: List[Dict[str, Any]] = []
    key_metrics: List[Dict[str, Any]] = []
    news: List[Dict[str, Any]] = []
    press: List[Dict[str, Any]] = []

    if include_estimates:
        estimates = fetch_analyst_estimates(client, symbol, period=statements_period, limit=statements_limit)

    if include_financials:
        income = fetch_income_statement(client, symbol, period=statements_period, limit=statements_limit)
        key_metrics = fetch_key_metrics(client, symbol, period=statements_period, limit=statements_limit)

    if include_news:
        news = fetch_stock_news(client, symbol)
        press = fetch_press_releases(client, symbol)

    return {
        “symbol”: symbol,
        “event”: event or {},
        “profile”: profile,
        “estimates”: estimates,
        “income”: income,
        “key_metrics”: key_metrics,
        “news”: news,
        “press”: press,
    }

def run(
    settings: Any,
    *,
    days_ahead: int = 7,
    limit: int = 10,
    symbols: Optional[List[str]] = None,
    include_estimates: bool = False,
    include_financials: bool = False,
    include_news: bool = False,
) -> None:
    “”“
    Two modes:
      1) Calendar mode (default): pull upcoming earnings, then build briefs.
      2) Watchlist mode: pass symbols=[...].
    “”“
    client = FmpClient(api_key=settings.api_key, base_url=settings.base_url)

    out_dir = Path(getattr(settings, “out_dir”, “output/briefings”))
    out_dir.mkdir(parents=True, exist_ok=True)

    events_by_symbol: Dict[str, Dict[str, Any]] = {}
    if symbols:
        target_symbols = _dedupe_keep_order(symbols)
    else:
        start = date.today()
        end = start + timedelta(days=days_ahead)
        events = fetch_earnings_calendar(client, start, end)

        for e in events:
            sym = (e.get(”symbol”) or “”).strip().upper()
            if sym:
                events_by_symbol.setdefault(sym, e)

        target_symbols = list(events_by_symbol.keys())[:limit]

    if not target_symbols:
        print(
            “No symbols returned.\n”
            “Confirm your base URL is https://financialmodelingprep.com/stable and your API key is valid.\n”
            “If you are on the free tier, some datasets may be restricted.”
        )
        return

    for i, sym in enumerate(target_symbols, start=1):
        bundle = build_bundle(
            client,
            sym,
            event=events_by_symbol.get(sym),
            include_estimates=include_estimates,
            include_financials=include_financials,
            include_news=include_news,
        )
        md = render_markdown(bundle)

        out_path = out_dir / f”{sym}.md”
        out_path.write_text(md, encoding=”utf-8”)
        print(f”[{i}/{len(target_symbols)}] wrote {out_path}”)

This is where the engine becomes a repeatable workflow. The code above basically does the following actions:

Pull a calendar window. Call earnings-calendar with from and to. This yields upcoming and past earnings events, including EPS fields when available.
Extract symbols and de-duplicate. Read the symbol field from the calendar results. De-duplicate while preserving order. Apply a small limit so runs remain predictable.
Fetch a fixed dataset set per ticker. Use the same calls for every symbol. Start with the essentials, then treat deeper fundamentals as optional. profile?symbol=... for sector and market cap style snapshot fields. analyst-estimates?symbol=...&period=...&page=...&limit=... for revenue and EPS expectations. Optionally income-statement for trend context andkey-metrics for compact KPI blocks.
Normalize into a stable bundle schema. Map responses into one predictable shape, then pass that shape downstream. Missing datasets are represented as{} or []. This keeps rendering stable even when some endpoints return no data on a given plan.
Write one artifact per ticker. For each bundle, call the renderer and save the Markdown into output/briefings/.

If you also support watchlists, you can populate event context using the per-company earnings endpoint, then reuse the same bundle and rendering path.

Step 6: Render the one-page brief in `app/render_markdown.py`

Next, we set up the render_markdown.py with the following code:

from __future__ import annotations

from datetime import datetime
from typing import Any, Dict, List, Optional

Json = Dict[str, Any]


def _first_dict(x: Any) -> Json:
    if isinstance(x, list) and x and isinstance(x[0], dict):
        return x[0]
    if isinstance(x, dict):
        return x
    return {}


def _as_list_of_dicts(x: Any) -> List[Json]:
    if isinstance(x, list):
        return [i for i in x if isinstance(i, dict)]
    return []


def _get_first_present(d: Json, keys: List[str], default: Any = “N/A”) -> Any:
    for k in keys:
        v = d.get(k)
        if v is not None and v != “”:
            return v
    return default


def _fmt_num(x: Any) -> str:
    if x is None:
        return “N/A”
    try:
        if isinstance(x, bool):
            return “N/A”
        if isinstance(x, (int, float)):
            if abs(x) >= 1_000_000_000:
                return f”{x/1_000_000_000:.2f}B”
            if abs(x) >= 1_000_000:
                return f”{x/1_000_000:.2f}M”
            if abs(x) >= 1_000:
                return f”{x:,.0f}”
            return f”{x:.4g}”
        xf = float(str(x).replace(”,”, “”))
        return _fmt_num(xf)
    except Exception:
        return str(x)

def render_markdown(bundle: Json) -> str:
    sym = bundle.get(”symbol”, “N/A”)

    event = bundle.get(”event”) or {}
    profile = bundle.get(”profile”) or {}
    estimates = _as_list_of_dicts(bundle.get(”estimates”))
    key_metrics = _as_list_of_dicts(bundle.get(”key_metrics”))
    income = _as_list_of_dicts(bundle.get(”income”))
    news = _as_list_of_dicts(bundle.get(”news”))
    press = _as_list_of_dicts(bundle.get(”press”))

    est0 = _first_dict(estimates)
    km0 = _first_dict(key_metrics)

    company = _get_first_present(profile, [”companyName”, “name”], sym)
    sector = _get_first_present(profile, [”sector”], “N/A”)

    # FIX: market cap key is commonly “marketCap” on profile payloads.
    mcap = _get_first_present(profile, [”marketCap”, “mktCap”, “marketCapitalization”], None)
    price = _get_first_present(profile, [”price”], None)

    event_date = _get_first_present(event, [”date”, “earningDate”], “N/A”)
    event_time = _get_first_present(event, [”time”, “timeEstimated”], “N/A”)

    # Prefer analyst estimates if present, otherwise fall back to the calendar row.
    eps_est = _get_first_present(est0, [”estimatedEps”, “epsEstimated”], None)
    if eps_est in (None, “N/A”):
        eps_est = _get_first_present(event, [”epsEstimated”, “estimatedEps”], None)

    rev_est = _get_first_present(est0, [”estimatedRevenue”, “revenueEstimated”], None)
    if rev_est in (None, “N/A”):
        rev_est = _get_first_present(event, [”revenueEstimated”, “estimatedRevenue”], None)

    lines: List[str] = []
    lines.append(f”# Earnings Briefing: {company} ({sym})”)
    lines.append(”“)
    lines.append(”## Event”)
    lines.append(f”- Date: {event_date}”)
    lines.append(f”- Time: {event_time}”)
    lines.append(”“)
    lines.append(”## Snapshot”)
    lines.append(f”- Sector: {sector}”)
    lines.append(f”- Price: {_fmt_num(price)}”)
    lines.append(f”- Market cap: {_fmt_num(mcap)}”)
    lines.append(”“)

    lines.append(”## Expectations”)
    lines.append(f”- Estimated EPS: {_fmt_num(eps_est)}”)
    lines.append(f”- Estimated revenue: {_fmt_num(rev_est)}”)
    lines.append(”“)

    # Only show these sections if you enabled them (or if your plan returns data).
    if km0:
        lines.append(”## Key metrics (latest)”)
        lines.append(f”- P/E: {_fmt_num(km0.get(’peRatio’))}”)
        lines.append(f”- Net margin: {_fmt_num(km0.get(’netProfitMargin’))}”)
        lines.append(”“)

    if income:
        lines.append(”## Trend context”)
        lines.append(”- Financial statements were fetched (see JSON bundle for details).”)
        lines.append(”“)

    if news or press:
        lines.append(”## Recent context”)
        if news:
            lines.append(”- Stock news:”)
            for n in news[:3]:
                title = _get_first_present(n, [”title”], None)
                pub = _get_first_present(n, [”publishedDate”, “date”], None)
                if title:
                    lines.append(f”  - {title}” + (f” ({pub})” if pub else “”))
        if press:
            lines.append(”- Press releases:”)
            for p in press[:3]:
                title = _get_first_present(p, [”title”], None)
                pub = _get_first_present(p, [”date”, “publishedDate”], None)
                if title:
                    lines.append(f”  - {title}” + (f” ({pub})” if pub else “”))
        lines.append(”“)

    lines.append(”## Questions to listen for”)
    lines.append(”- What changed in demand, pricing, or volume versus last quarter?”)
    lines.append(”- What is driving margin movement?”)
    lines.append(”- What guidance signals matter most for the next two quarters?”)
    lines.append(”“)
    lines.append(f”_Generated at {datetime.utcnow().strftime(’%Y-%m-%d %H:%M UTC’)}. Not financial advice._”)
    return “\n”.join(lines)

The renderer code above takes the bundle and produces a consistent Markdown page:

Event section uses the calendar row
Snapshot section uses profile fields
Expectations use analyst estimates, with optional fallback to calendar fields
Optional sections appear only when data exists

Step 7: Add entry points for two run modes

We keep the entry points thin:

run.py for “upcoming earnings” mode. It runs the earnings-calendar window and generates briefings for symbols in that window. We can tweak the cide as below:

from app.config import get_settings
from app.engine import run

if __name__ == “__main__”:
    settings = get_settings()
    run(settings, days_ahead=7, limit=10)

run_watchlist.py for “watchlist” mode. It runs the same bundle and renderer, but starts from a fixed list of symbols:

from app.config import get_settings
from app.engine import run

WATCHLIST = [”AAPL”, “MSFT”, “NVDA”, “TSLA”]

if __name__ == “__main__”:
    settings = get_settings()
    run(settings, symbols=WATCHLIST)

If you want watchlist mode to always show an earnings context, you can enrich it with the per-company earnings endpoint.

Step 8: Verify outputs and iterate safely

A successful run should produce one Markdown file per ticker under output/briefings/. For example, the result is shown below:

# Earnings Briefing: Shopify Inc. (SHOP)

## Event
- Date: 2026-02-11
- Time: N/A

## Snapshot
- Sector: Technology
- Price: 112.9
- Market cap: 147.38B

## Expectations
- Estimated EPS: 0.5
- Estimated revenue: 3.59B

## Questions to listen for
- What changed in demand, pricing, or volume versus last quarter?
- What is driving margin movement?
- What guidance signals matter most for the next two quarters?

_Generated at 2026-02-05 17:11 UTC. Not financial advice._

If you see missing event dates, expand the calendar window. If you see missing expectations, confirm that estimates are enabled and available for those symbols. If you hit request limits, reduce the batch size or add caching. The Basic plan call limit is published in FMP’s plan comparison

That’s all you need to know on how to build an Earnings Briefing Engine using the FMP API.

Conclusion

In this article, we have learn on how to build an earnings briefing engine that reduces manual effort during earnings weeks by enforcing a repeatable workflow.

Using Financial Modeling Prep (FMP) as the primary data source, the process relies on a stable API to retrieve earnings events and selected supporting context, then we summarize the results into a standardized one-page briefing format that can be stored and reused.

In practice, this system will beuseful for maintaining a disciplined pre-earnings routine, supporting watchlist management during busy reporting weeks, and creating a written record of what to review before each announcement.

I hope it has helped!

NBD Focus Map (Free PDF)

Cornellius Yudha Wijaya — Sun, 01 Feb 2026 12:36:35 GMT

Most people do not struggle because they lack effort. They struggle because they learn without a plan.

The Focus Map is my way of turning Non-Brand Data into a simple path you can follow. Pick one track, stick with it for 2–4 weeks, and ship one mini project at the end.

Subscribe now

What you’ll get

SQL track for real analysis work
Python + ML track for practical modelling
RAG track for building question-answering systems on documents

Each track includes:

5 posts to read in order
a weekly cadence (3 sessions/week, 60 minutes/session)
One mini project with clear deliverables
What “good” looks like, so you know when to move on

Note: the SQL Crash Course is a collaboration with Josep Ferrer (DataBites), so a few lessons are open on databites.tech.

Download the PDF below for the NBD Focus Map.

Nbd Focus Map

124KB ∙ PDF file

Download

If you'd like, reply and let me know which track you’re starting with.

Leave a comment

After you finish a track

If you finish one track and you have a notebook, repo, or write-up, you are already ahead of most people.

The next step is making it sharper and more reusable.

1) Turn it into a portfolio-ready artifact

I’m packaging a Portfolio Rubric Toolkit to help you score your project, spot what is missing, and decide what to fix first.

Portfolio Rubric Toolkit (upgrade your project in 30–60 minutes):

Portfolio Rubric Kit

2) Keep momentum with guided paths and templates

If you prefer a more structured approach, the paid tier is built around member-only deep dives, reusable templates, and guided paths through the archive.

Subscribe now

The Portfolio Rubric Data Science Hiring Managers Use

Cornellius Yudha Wijaya — Tue, 20 Jan 2026 10:42:48 GMT

Image generated with ideogram.ai

Picture this scenario: You spend weeks polishing a Kaggle competition notebook with immaculate code, fancy plots, and a near-perfect model. You feel confident. Then, in an interview, the hiring manager asks, “How would this work with messy real data? Where is the business decision here?” You scramble for an answer. Awkward silence. The truth is that Kaggle taught you how to compete, not how to solve real business problems. In fact, “most recruiters don’t care about your Kaggle rank”. They care about something else entirely.

Too many data science portfolios list projects that impress on the surface but fail to deliver real value. As a hiring manager who’s screened dozens of candidates, they have noticed a persistent gap between what candidates showcase and what teams actually need. The typical portfolio is just a list of projects, but what they are looking for is evidence of impact, realism, and critical thinking behind them.

The good news is that you don’t need a dozen fancy projects to stand out. You need the right qualities in whichever projects you present. Below, I’ll share the rubric hiring managers use to evaluate data science portfolios, which is a simple scoring framework covering the five areas that matter for hiring.

We’ll also look at common mistakes to avoid and quick fixes to upgrade your existing portfolio. By the end, you’ll understand exactly what hiring managers are looking for and how to demonstrate it in your portfolio.

Let’s get into it.

Subscribe now

Common Portfolio Mistakes (What to Avoid)

Even experienced data scientists fall into some classic portfolio traps. Before we discuss what to do right, let’s highlight what not to do. Here are some common mistakes that cause portfolios to miss the mark:

Using Only Toy or Overused Datasets: Relying on Titanic survival predictions or Iris classification projects shows a lack of originality. Recruiters have seen these portfolios thousands of times, and a collection full of such washed-out projects will bore them. It also indicates you haven’t worked with realistic data. An industry insider said, “I hate seeing people use common Kaggle datasets like Titanic or Iris. Instead, try to scrape your own data or find unique sources.” Overall, if your data is pre-cleaned and common, it doesn’t demonstrate your ability to handle real-world data quirks.
No Clear Problem or Purpose: Failing to define a business question or real-world purpose is a common mistake. A portfolio project like “I built a neural network to classify images” without context won’t impress hiring managers. They want to know why you did it, whether it solves a meaningful problem or was just a class assignment. If you can’t explain the problem and its significance, it shows a lack of business thinking. Many portfolios fail not due to technical skill but because they don’t communicate value. Avoid projects without a narrative of who benefits or what decisions can be made. For example, don’t say “it was a bootcamp group project” when asked why you chose it, show that you addressed a problem you care about or an issue relevant to a business.
Metrics Over Impact (Model-Centric Thinking): Many candidates focus on achieving 99% accuracy in a model and present that as the victory, but hiring managers are wary of this. Focusing on metrics instead of business value is a mistake. For example, a churn prediction model with an AUC of 94% sounds good but has little value if it mostly flags customers who no longer use the product. A narrow focus on metrics often means ignoring whether the solution solves the core problem. Employers want you to deliver value, so don’t just brag about high scores but show you understand the “so what?” of your results.
Ignoring Deployment and Next Steps: A common mistake is treating projects as standalone exercises. Creating a model isn’t enough; its value lies in deployment and usage. If your projects don’t mention how to implement, use, or the next steps after building the model, hiring managers notice. Most employers won’t consider you a serious candidate for senior employment without knowledge of deployment, retraining, or monitoring. You don’t need to be an MLOps expert, but showing deployment ideas (even hypothetical) is crucial.
Poor Presentation and Communication: Many portfolios are hard to read, lacking README files, commentary, or visualizations, making it tiring for reviewers to understand your project. A hiring manager said, “I hate seeing a big mess of code with no README or TL;DR.” Without a clear summary or visual results, your work can be overlooked. Hiring managers glance through dozens of portfolios, so if yours doesn’t quickly highlight key points, it likely won’t hold attention. Another manager said, “I ignore side projects unless they show real impact... I need impact, not just some model.” Showing impact also means presenting insights simply—pictures or charts often communicate more effectively than words. Portfolios without an executive summary, well-designed graphs, or an organized story are at a disadvantage.

Avoid these pitfalls:

Steer clear of overly common projects,
Always define the problem and the value,
Think beyond accuracy alone,
Consider real-world deployment,
Present your work clearly.

Next, we’ll discuss exactly what hiring managers are looking for instead and how to ensure your portfolio checks those boxes.

What Hiring Managers Are Actually Looking For

So what does impress a hiring manager in a data science portfolio? In a word: impact.

They want to see proof that you can apply data science to solve real problems, not just toy exercises. From my experience, this boils down to a few key qualities. Specifically, they evaluate portfolios across five dimensions that map to real on-the-job success:

Problem Framing: Did you clearly define the problem you tackled and why it matters? Great portfolios start with a well-scoped question or business problem, not just a technique. (Is it a meaningful, non-trivial problem, and do you understand the context around it?)
Data Realism: Did you use data that’s reflective of real-world complexity? This includes working with messy or authentic datasets, not only pristine samples. It shows you can handle real data challenges and demonstrates curiosity in sourcing data beyond the usual examples.
Evaluation Rigor: How do you measure success, and how trustworthy are your results? We look for the use of proper metrics, baseline comparisons, validation techniques, and an honest assessment of model performance. In short, are you skeptical about metrics and careful about conclusions, or are you just accepting whatever accuracy pops out?
Deployment Thinking: Did you consider what happens after the model is built? That means thinking about how the solution could be deployed or used in production. For example, packaging the model, building an API, or simply discussing how a business could implement your insights. This shows a “product readiness” mindset, not just academic analysis.
Communication: Could someone who isn’t you understand and appreciate the project quickly? This covers the clarity of your writing, visualization of results, and overall storytelling. Great portfolios read almost like case studies: they draw the reader in, highlight key findings, and explain technical details in an accessible way. In fact, storytelling and clear communication are becoming increasingly important. Companies want data scientists who can clearly explain insights, not just write code.

These five categories form the Portfolio Rubric that many hiring managers use to score a portfolio. Think of each as a lens through which your project is evaluated. If your portfolio projects excel in these areas, you’re demonstrating the qualities that truly matter on the job.

In the next sections, we’ll break down each rubric category in detail. For each category, I’ll explain why it matters in real-world terms and what distinguishes an average project from an outstanding one. I’ll even provide sample scoring criteria so you can gauge where your projects might fall.

Let’s dive into the rubric that can make your portfolio a hiring manager’s dream.

Leave a comment

The Portfolio Rubric: 5 Key Evaluation Categories

1. Problem Framing

Problem framing is about setting the stage. It’s answering: “What exact problem are you solving, and why does it matter?” A strong portfolio project doesn’t start with “I used X algorithm”; it starts with a clear question or objective. For example, instead of “I built a time series model,” good framing would be “I forecasted weekly sales to help a retailer manage inventory,” which is a specific problem with a business context.

In industry, choosing the right problem is half the battle. Companies need data scientists who focus on impactful questions, not just cool techniques. If a project lacks context, it “only shows your lack of business thinking”. Remember, a brilliant model solving an irrelevant problem is a wasted effort. Hiring managers look for whether you understood the purpose behind the project. Did you identify a stakeholder or decision-maker, and what they care about? Do you connect your results to a business outcome or insight?

For example, a candidate’s portfolio included a project “Predicting Employee Attrition.” On paper, it was a classification model with decent accuracy. But what impressed me was the framing. They introduced it as “Employee turnover prediction to inform HR retention strategies” and discussed how reducing attrition could save money. That context turned a generic model into a compelling story of business value.

How we score it (Problem Framing):

Level 1 (Needs Improvement): The project lacks a clear question or goal. It feels like a generic exercise (e.g., “I applied X algorithm to Y data” with no further context). The reader can’t tell what problem this solves or why it’s important.
Level 2 (Good): The project defines a problem, but in a somewhat generic way or without emphasizing its importance. There’s a basic problem statement (e.g., predicting house prices), but little discussion of who benefits or what one would do with this prediction. Some context is given, but it may be shallow or assumed.
Level 3 (Excellent): The project is framed around a specific, meaningful problem with real-world context. It’s immediately clear why the problem matters (e.g., “predicting equipment failure to reduce downtime costs”). The candidate explains the background and stakes: who has this problem, what decision the analysis will inform, and how success is defined. The scope is well-defined (not too broad or vague), showing the candidate knows how to translate an ambiguous idea into a concrete data question.

2. Data Realism

Data realism refers to using data and approaches that mirror real-world conditions. This means datasets that are messy, large, or obtained from authentic sources. Not just tidy CSVs everyone’s seen before. It also means demonstrating data wrangling and an understanding of data quality, rather than assuming data is perfect.

In industry, data is often messy or incomplete. Using only clean, toy datasets (like Kaggle or classroom sets) doesn’t prove you can handle real data challenges. Recruiters know anyone can run a model on Titanic or Iris; that doesn’t make you stand out. Relying on such projects may cause recruiters to ignore you, as your portfolio shows a lack of creativity. Instead, sourcing interesting datasets or demonstrating how you managed missing values, outliers, or scaling shows initiative and practical skill. A hiring manager suggests scraping your own dataset or seeking rarer datasets, rather than recycling common examples.

Imagine two candidates. Alice uses the Titanic dataset but writes as if she’s helping a cruise company improve safety, discussing the dataset's limitations (e.g., a sample of historical passengers) and how she’d gather more current data. Bob uses the Titanic dataset and just builds a classifier with 99% accuracy (on a cleaned dataset where missing ages were already handled). Alice is demonstrating data realism; Bob is not. We’re more likely to interview Alice because she’s thinking like a professional dealing with real data problems.

How we score it (Data Realism):

Level 1 (Needs Improvement): Uses only small, common datasets with no evidence of data cleaning or exploration. It appears the data was taken “as is” from a textbook or Kaggle, with no mention of missing values, anomalies, or domain specifics. No data sourcing effort is shown (the data fell into their lap). This suggests the candidate might struggle when faced with untidy real-world data.
Level 2 (Good): Uses a reasonable dataset and shows some data cleaning or feature engineering, but nothing beyond the ordinary. The dataset might still be a common one, but the project at least acknowledges data issues (e.g., “had to handle class imbalance by ...” or “combined two data sources”). There is evidence that the candidate can do basic wrangling and is aware of data limitations, though they may not have sought out truly novel data.
Level 3 (Excellent): The project uses realistic data, possibly self-collected or multi-source. The candidate may have accessed an API, scraped data, or used an open data portal to gather new data. They clearly document the data cleaning steps and challenges (e.g., handling missing data, skewed distributions, or integrating data from different sources). The approach shows creativity in data sourcing and thoroughness in preparation. It’s evident they didn’t just accept the data at face value – they explored its quality and shaped the data to fit the problem, just like one must do on real teams. This level demonstrates that the person can handle the messiness of actual business data.

3. Evaluation Rigor

Evaluation rigor means critically assessing your model’s performance and results. It’s about using the right metrics, establishing baselines, properly validating the model, and interpreting the outcomes with a skeptical eye. Rigorous evaluation answers: “How do I know my solution actually works, and how well?”

In real projects, a model is only as good as the evidence that it works for the intended purpose. Hiring managers want to see that you didn’t just run to a conclusion, but that you actually tested it. This includes simple things like comparing against a baseline (e.g., how does your model compare to a naive guess or the current solution?) and using appropriate metrics for the problem (e.g., using precision/recall for a class-imbalanced problem instead of just accuracy). It also means checking for overfitting, using cross-validation or a test set, and analyzing errors or uncertainty.

Portfolios that demonstrate evaluation rigor stand out. For instance, if you built a classifier, did you also provide a confusion matrix and discuss false positives versus false negatives in context? If you did time-series forecasting, did you hold out the last few months as a true future test? If you optimized a metric, did you consider whether that metric truly reflects business success? Showing such thoroughness tells me they can trust your work.

I recall a portfolio project on image classification where the candidate not only reported accuracy but also deliberately added noise to the images to test robustness and plotted how performance dropped. They also compared their CNN to a simpler logistic regression as a baseline. This thorough evaluation was a green flag, as it demonstrated scientific thinking and honesty about the model’s capabilities.

How we score it (Evaluation Rigor):

Level 1 (Needs Improvement): The project shows minimal evaluation. Perhaps only a single metric (like accuracy) is reported without context, or results are presented without validation (e.g., performance only on the training set or a cherry-picked example). There’s no baseline or benchmark mentioned. You can’t tell whether 90% accuracy is good or trivial, given the problem. No discussion of errors, assumptions, or limitations is present. This indicates a lack of critical thinking about the results.
Level 2 (Good): The project uses standard evaluation practices, e.g., a train/test split or cross-validation, and reports at least one appropriate metric on a held-out set. A baseline may be mentioned (e.g., “our model beats a random guess, which was 50%” or “improves over a simple linear model by 10%”). The candidate likely includes some error analysis or at least mentions possible improvements. However, the evaluation might still miss deeper issues (for example, reporting overall accuracy without noting that one class was often mispredicted, or not considering how an unbalanced dataset might skew the metric). Solid effort, but not deeply probing.
Level 3 (Excellent): The project demonstrates thorough evaluation, considering multiple performance metrics, including precision, recall, ROC, and domain-specific metrics. It establishes a clear baseline, checks for overfitting (train vs. validation curves), uses methods such as cross-validation, performs sensitivity analysis, and tests edge cases. They interpret results in context: Is the performance acceptable? (e.g., “An F1 of 0.7 means 30% issues missed, and is it acceptable in healthcare?"), and acknowledge limitations like data bias or assumptions. This rigor reflects a mindset of skepticism and decision-making focus, which we value.

4. Deployment Thinking

Deployment thinking evaluates whether you considered how the project’s solution would be used in a real-world environment. In other words, did you think beyond the notebook? This could include creating a simple web app for your model, following proper coding practices to package your project, or simply writing a paragraph on how you’d deploy and monitor the model in production.

In modern data science teams, the work doesn’t stop at insight or model training. Models often need to be integrated into products or processes. While you might not personally build the entire production pipeline, you will collaborate with engineers or hand off your work for implementation. Hiring managers, therefore, value awareness of deployment considerations. If two candidates both build a decent model, but one also sets up a Flask API or describes a plan for real-time inference, that candidate demonstrates ownership and practicality. It shows they think about reliability, data pipelines, or user impact, not just modeling.

In fact, not showing any hint of deployment or next steps can be costly. As noted earlier, employers might question how you’ll add value if “you can stick your model you-know-where if it’s not usable in production”. We test for a mindset of “production readiness,” which means you anticipate the steps needed to make your work actually run and keep running in a live setting.

Consider a portfolio project that predicts stock prices. Deployment considerations might include: “I scheduled this script to run daily and send an email alert with the latest prediction.” Or “I deployed the model as an API using Streamlit so you can try it live.” Or even, “In a real company, I’d retrain this model weekly as new data comes in and monitor the prediction error over time to detect drift.” These elements turn a good project into a great one by showing you understand the full lifecycle of ML products.

How we score it (Deployment Thinking):

Level 1 (Needs Improvement): There’s no mention of deployment or next steps. The project ends at model evaluation. It’s as if the analysis exists in isolation. There’s no consideration of how the model could be consumed (e.g., by an application or user) or maintained. The code may be very prototype-like (hard-coded paths, not modular), suggesting it’s not ready to be used elsewhere. This suggests the candidate hasn’t considered real-world implementation.
Level 2 (Good): The project shows some awareness of deployment, though it’s minimal. Perhaps the candidate structured their code well or included instructions for running the project. They might mention in passing how the model could be used (e.g., “this model could be deployed as a REST API” or “in production we’d need to retrain periodically”). There may not be an actual deployment, but there’s at least recognition of the need. Alternatively, they might have taken a small step, such as containerizing the project or using a simple dashboard to present results. It’s a hint that they know deployment is important, even if they haven’t fully demonstrated it.
Level 3 (Excellent): The project actively incorporates deployment considerations or deliverables. The candidate might have a live demo (a web app, an interactive notebook, or a command-line tool) that others can interact with. Or they provide a link to a GitHub repo with a Dockerfile and clear instructions, showing you could actually run their solution easily. They discuss how they would handle tasks such as model monitoring, data updates, scaling, and integration with existing systems. In essence, they treat the project as a product rather than just an analysis. This aligns with what many hiring managers quietly look for, which is a sense of “ownership & reliability” in how you approach your work.

5. Communication

Communication in a portfolio context refers to how well you convey the story and results of your project to others. This includes the organization of your content, the explanations you provide (in writing or orally if presented), the visualizations you choose, and the overall storytelling of the project. Essentially, if someone (technical or not) reviews your project, do they quickly grasp the what, why, and how of it?

Data science is a team sport, and often a business-facing one. It’s not enough to have a brilliant analysis; you must also communicate insights to colleagues, managers, or clients. Hiring managers, therefore, seek evidence of strong communication skills in your portfolio. A well-documented project with clear Markdown cells, captioned charts, and a logical flow demonstrates that you can explain your work.

In practical terms, good communication in a portfolio might mean having a README summary for each project, highlighting key results upfront, and guiding the reader through your process step by step. It also means tailoring the depth of technical detail to your audience. For example, explaining technical concepts or decisions in plain language where appropriate, and using visuals to make results intuitive. A common mistake (as we saw) is to dump a lot of code or an overly complex notebook without context. Instead, present a narrative such as what problem you tackled, what the data told you, what model you built, how well it worked, and what it means.

I once reviewed a candidate’s portfolio project on customer segmentation. They included a before-and-after chart showing how their clustering grouped customers in a new way, along with a short paragraph: “Segment 3 (orange in the chart) had the highest lifetime value but low engagement. This insight suggested a targeted re-engagement campaign for this group.” That single visualization and explanation conveyed the essence of the project’s impact. Compare that to someone who might simply say, “I did K-means clustering on customers,” and dump the cluster centers without context. The former demonstrates excellent communication and understanding of the audience’s needs.

How we score it (Communication):

Level 1 (Needs Improvement): The project is difficult to follow. There’s little to no documentation or explanation. Perhaps the code is there, but the why behind the steps is not explained. Visualizations, if any, are poorly labeled or absent. There’s no clear introduction or conclusion. Essentially, only someone with the candidate’s exact knowledge could decipher the project. This raises concerns about how the person would communicate on a team or to stakeholders.
Level 2 (Good): The project is understandable with some effort. The candidate provides a decent structure (e.g., sections in a notebook, some comments or markdown explaining each part). They include a couple of key plots or tables and attempt to summarize findings. However, the narrative might not be as tight or engaging as it could be. Perhaps the introduction or conclusions are brief, or the visuals could be clearer. It’s adequate, but it might not fully grab a non-expert audience or highlight the most important insights upfront.
Level 3 (Excellent): The project is structured like a compelling story or case study, starting with a brief overview of the problem and approach, then explaining the methodology step-by-step in simple terms, and concluding with clear recommendations. Visuals are used effectively to support the findings, each accompanied by a descriptive title or caption. The writing is concise, with minimal jargon or explanations, making it accessible to both technical and business audiences. Attention to design details, such as bullet points or bold highlights, emphasizes key insights. This allows reviewers to quickly grasp the main points or explore detailed reasoning, demonstrating that the candidate can communicate effectively across functions and deliver meaningful insights beyond just modeling. Ideally, the project is engaging, inspires care for the outcome, and showcases strong storytelling skills.

Those are the five rubric categories:

Problem Framing,
Data Realism,
Evaluation Rigor,
Deployment Thinking,
Communication.

Great portfolios hit high marks in all five.

Next, let’s see how you can apply this rubric to improve your own portfolio, even if you’re short on time.

Share Non-Brand Data

Quick Fix: How to Upgrade Your Portfolio in 2 Hours

You might be thinking, “This is great for planning new projects, but what about the projects I already have?” The good news is that you can improve an existing portfolio relatively quickly by addressing the rubric criteria. Here’s a step-by-step game plan (which you can literally do in an afternoon) to level up your portfolio using the rubric:

Pick Your Best Project (Focus Your Effort): If you have many projects, identify one or two that are most relevant to the roles you want or that best showcase your skills. It’s often better to have one polished, rubric-aligned case study than five mediocre ones. Hiring managers spend maybe 2-3 minutes on an initial portfolio glance, so you want your standout work front and center.
Add a Clear Problem Statement: Open your project README or the top section of your notebook. Write a one-paragraph intro that answers: What problem are you solving and why should anyone care? Be specific and use plain language. For example, “Goal: Reduce customer churn by predicting which users are likely to cancel, so the marketing team can intervene with retention offers.” This immediately frames the project in terms of business value and hooks the reader.
Provide Context on Data: Next, describe the dataset and why it’s appropriate (or if it has limitations). If it’s a well-known dataset, acknowledge that and perhaps note how you treated it: “We use the Telco Customer Churn dataset (IBM Sample) as a proxy for a subscription business’s customer data. In a real scenario, we’d gather recent customer activity and subscription details; the sample data serves as a stand-in, which I augmented by adding some noise to simulate real-world imperfections.” If you did any data cleaning or feature engineering, summarize that process. This shows Data Realism. Even a sentence like “Note: I had to impute missing values for tenure and handle class imbalance (only ~26% churned) by oversampling” demonstrates that you dealt with data issues (and gets you points on the rubric).
Insert a Baseline and Evaluation Highlights: Scan your results section. Have you indicated what performance you’d consider good, or what you’re comparing against? If not, add a baseline. This could be as simple as “For context, if we predict ‘no churn’ for everyone, we’d get ~74% accuracy (the non-churn rate). Our model achieves 85% accuracy, significantly improving over this baseline.” Also, ensure you mention the key metric(s) and why they make sense: “We optimize for recall, to catch as many churning customers as possible, because missing a churning customer is costlier than a false alarm in this context.” This addition shows Evaluation Rigor and aligns your project with real decision-making. It can be done with just a few lines of text or an extra table comparing metrics.
Discuss Deployment (Even Hypothetically): Add a short section titled “Deployment & Next Steps” at the end. Here, write a few sentences about how this model/analysis could be used in production or what you’d do next if this were a real company project. For example: “If this model were deployed in a company, I’d set it up as a daily batch job scoring each active user. Users predicted to churn would be fed into a CRM tool for the marketing team to target. I’d also monitor the model’s precision/recall over time – if performance drifts, I’d retrain with fresh data. For real deployment, we’d need to integrate with the data warehouse and ensure predictions happen within a week of a customer’s last activity.” You don’t have to actually deploy it, but showing you understand the path to production is immensely valuable. It shows that you think like someone who wants to drive results, not just build models.
Tighten the Narrative and Presentation: Now polish the communication. Ensure your notebook or report has a logical flow: Introduction → Data → Method → Results → Conclusion. Add or refine chart titles and axis labels to be more descriptive (e.g., “Churn Rate by Tenure Group” instead of “Figure1.png”). Consider adding an illustrative plot if you haven’t (for instance, a bar chart of feature importances or a sample of predictions vs. actual outcomes). Also, write a short conclusion that reiterates the key insight or performance: “Conclusion: The model can identify ~50% of churners with 80% precision, which could significantly reduce churn if retention offers are effective. The factors of contract length and monthly charges were the strongest churn predictors, aligning with business intuition.” This helps a skimmer get the point and shows you understand the results in context. Finally, if the project is on GitHub, make sure the README highlights these points and not just the technical setup.
Apply the Same Steps to Other Projects (if time permits): If you have another project that’s relevant (say one NLP project and one computer vision project to showcase range), repeat the above steps there. But remember, quality over quantity. It’s better to fully refurbish one project than half-fix three of them. You want at least one example that scores high on all rubric dimensions.

Within about 2 hours, using the steps above, you can transform a bland, academic project into a professional case study. The key is reframing your existing work to speak the language of hiring managers and to highlight business value.

🚀 Premium Content: Portfolio Rubric Toolkit (Downloadable)

The section below is for Premium subscribers and includes downloadable tools & examples to help you implement the ideas above. Upgrade to access the full toolkit. 🚀

Best Financial Data APIs in 2026

Cornellius Yudha Wijaya — Mon, 12 Jan 2026 06:47:18 GMT

Photo by Campaign Creators on Unsplash

Financial data APIs provide a direct, programmatic pathway to market information. They support a wide range of applications, including financial analytics, research workflows, automated reporting, and data-driven products. In 2026, the ecosystem is mature and competitive. Many providers offer overlapping capabilities on the surface, yet practical differences can affect implementation quality and long-term maintainability.

In practice, providers vary in their market presence and the continuity of their historical datasets. They also differ in the depth and standardization of basic data, the availability of real-time or streaming access, and the limitations imposed by rate limits. The quality of documentation, integration tools, and licensing terms also influences whether an API remains usable after initial testing. Given these differences, we need to determine which Financial data APIs best fit our needs.

In this article, we will review the best financial data APIs available in 2026. The objective is to present clear trade-offs rather than a single universal solution. For each provider, I summarize the types of data you can retrieve, the key advantages and disadvantages, and the contexts in which the API is appropriate.

Curious about it? Let’s get into it.

Subscribe now

Financial Modeling Prep (FMP)

Overview

Financial Modeling Prep (FMP) is a financial data API provider that focuses on broad market coverage and practical endpoints for application development. It offers market prices and fundamental datasets through a straightforward REST interface.

Advantages

All-in-one coverage: Provides pricing data, company fundamentals, macroeconomic indicators, and market news in one place.
Rich endpoint selection: Includes many ready-to-use endpoints, reducing the need for additional data stitching.
Strong developer usability: Clear documentation and a predictable API structure make integration and iteration efficient.
Product-oriented fit: Well-suited for building stock screeners, analytics dashboards, and research pipelines that combine price and fundamental data.

Disadvantages

Limited free tier: The free plan is suitable for testing and light usage, but rate limits and reduced data depth limit its usefulness.
Advanced access requires upgrades: Certain datasets and higher-capacity usage are reserved for higher-paid tiers.

Best for

Teams or individuals who want a single API that can support both market data and fundamentals for analysis and product development.

Ideal starting plan

Start with the free tier to validate endpoints and data fit, then move to the entry-level paid tier once you need consistent throughput or deeper coverage.

Alpha Vantage

Overview

Alpha Vantage is a comprehensive financial data API platform designed for both retail investors and institutional trading systems. It provides extensive coverage across equities, options, forex, cryptocurrencies, and macroeconomic datasets, combining real-time market feeds with deep historical data and built-in analytics.

A key differentiator is that Alpha Vantage sources data from licensed exchanges such as NASDAQ and Options Price Reporting Authority (OPRA), enabling access to professional-grade market data infrastructure through a simple API interface. With millisecond-level real-time updates and more than 20 years of historical price and fundamental data, the platform supports everything from educational projects to institutional-scale algorithmic trading systems.

Advantages

Institutional-grade data licensing: Alpha Vantage is officially licensed by major market data authorities, including NASDAQ and OPRA, ensuring reliable and compliant access to equity and options data streams. This makes it suitable for professional trading environments that require high-quality exchange-sourced data.
Real-time and low-latency market data
The platform delivers millisecond-level real-time data, enabling use cases such as algorithmic trading, quantitative research, and automated portfolio monitoring where latency and accuracy are critical.
Extensive historical coverage
Alpha Vantage offers 20+ years of historical price data across global markets, along with long-range fundamental datasets. This depth allows analysts and quantitative researchers to perform robust backtesting and long-horizon market studies.
Built-in technical analysis library
The API includes a large catalogue of technical indicators that can be retrieved directly through API calls. This significantly reduces engineering overhead for traders and developers who would otherwise need to implement indicator calculations themselves.
Accessible architecture for all users
Despite its institutional capabilities, Alpha Vantage maintains a clean, developer-friendly API structure that allows beginners, independent traders, and large trading firms to integrate financial data pipelines quickly.

Disadvantages

Free tier constraints: similar to other providers, certain features are not included in the free tier of Alpha Vantage for compliance and anti-bot purposes.

Best for

Alpha Vantage is particularly well-suited for:

Retail investors and independent developers building trading tools or investment dashboards
Quantitative researchers requiring long historical datasets for backtesting
Algorithmic and institutional trading systems that need real-time exchange-licensed data feeds
Fintech platforms seeking a single API for market data, fundamentals, and analytics

EOD Historical Data (EODHD)

Overview

EOD Historical Data (EODHD) is a market data provider known for broad international exchange coverage and long historical time series. It combines end-of-day and intraday pricing with fundamentals and several optional datasets that support more advanced workflows.

Advantages

Strong global coverage with a long history: Offers broad exchange support and historical depth suitable for long-horizon analysis and backtesting.
High value on paid tiers: Paid plans are competitively priced for the amount of data provided, especially when you need global markets and deeper history.
Solid fundamentals and add-ons: Includes company fundamentals and supports additional datasets such as options and macroeconomic indicators, depending on the plan.
Practical integration options: Supports bulk-style access for efficient retrieval, provides some streaming capabilities, and offers spreadsheet-friendly integrations for Excel and Google Sheets.

Disadvantages

The free tier is primarily for evaluation. Request limits are restrictive, so it is best treated as a connectivity and fit check rather than a long-term solution.
Real-time depth is uneven: Real-time availability and latency can differ by asset class and region, with stronger coverage typically in U.S. markets than in many international markets.

Best for

Projects that require global market coverage and long historical datasets, especially when you want substantial value from paid plans.

Finnhub

Overview

Finnhub is a financial data API that combines market quotes with news and event-oriented datasets. It is widely used for prototyping and product development because it offers accessible pricing and a relatively broad feature set.

Advantages

Generous free-tier limits: The free plan typically provides sufficient request capacity to support meaningful experimentation and early-stage prototypes.
Balanced dataset mix: Provides a practical combination of quotes, news, sentiment signals, and market calendars, helping build context-aware applications.
WebSocket support: Provides streaming access through WebSockets, enabling lower-latency updates without relying exclusively on polling.

Disadvantages

Shallower fundamentals: Fundamental coverage is generally less comprehensive than that of providers that focus heavily on financial statements and deep company datasets.
Paid plans for full access: Longer historical depth and specific premium endpoints are gated behind paid tiers, particularly for more advanced or higher-volume use cases.

Best for

Rapid prototyping and application development that benefits from combining price data with news, sentiment, and event calendars.

Tiingo

Overview

Tiingo is a financial data provider that emphasizes clean historical market data and straightforward API access. It is commonly used in research and backtesting workflows, particularly by individual developers and small teams.

Advantages

Substantial value for individuals: Paid plans are typically affordable given the included data and request limits, making Tiingo attractive to solo builders.
High-quality historical end-of-day data: Tiingo is well-regarded for stable, consistent EOD datasets that support backtesting and long-horizon analysis.
Practical fundamentals for U.S. equities: On paid tiers, Tiingo provides solid fundamental coverage of U.S. companies, often sufficient for screening and basic factor research.

Disadvantages

Less comprehensive as an all-in-one source: Tiingo is not primarily positioned as a single provider of macroeconomic data and commodities coverage so that you may need supplementary sources depending on your requirements.
Real-time and intraday are not the core focus: While intraday data may be available, it is not as central or as feature-complete as providers optimized for streaming or high-frequency use cases.

Best for

Individuals and small teams who want reliable historical market data for analysis and backtesting, with reasonable U.S. fundamentals on a cost-effective paid plan.

Twelve Data

Overview

Twelve Data is a market data API focused on time-series access across multiple asset classes. It is commonly used for applications that need consistent pricing endpoints for stocks, foreign exchange, and cryptocurrencies.

Advantages

Clean multi-asset time-series API: Provides a uniform way to retrieve historical and intraday price data across stocks, FX, and crypto, simplifying implementation.
Strong developer experience: Documentation is generally clear, integration is straightforward, and common workflows are well-supported.
Built-in indicators: Includes technical indicators that reduce the effort required to add analytics to a prototype or dashboard.

Disadvantages

Paid tiers may feel expensive: Pricing can be less attractive when compared with alternatives that offer broader datasets at similar cost levels.
Limited depth beyond prices: Fundamental coverage and macroeconomic datasets are typically less extensive than those from all-in-one providers.

Best for

Projects that primarily require reliable multi-asset price time series, a developer-friendly API, and convenient technical indicators.

Marketstack

Overview

Marketstack is a market data API focused on global equity pricing, with coverage across many stock exchanges. It is designed for simple, real-time, and historical stock price retrieval via a lightweight REST interface.

Advantages

Simple global stock pricing access: Works well when your primary need is equity quotes and historical prices across multiple markets, without complex endpoint structures.
Affordable entry-level paid tier: Paid plans are typically priced for basic application use cases, making them practical for small dashboards and lightweight integrations.

Disadvantages

Limited fundamentals and extended datasets: Marketstack is primarily price-oriented and offers fewer fundamentals, corporate datasets, and value-added endpoints than all-in-one providers.
No integrated FX or crypto coverage: Foreign exchange and cryptocurrency data are not included in the core product and often require separate services.

Best for

Basic applications that need straightforward global stock price data at a predictable cost, without strong requirements for fundamentals or multi-asset coverage.

Polygon.io (Massive)

Overview

Polygon.io (now positioned under the “Massive” brand) is a market data provider focused on high-performance access to U.S. market data. It is best known for low-latency delivery, streaming support, and granular datasets suitable for trading-oriented workloads.

Advantages

Strong U.S. real-time and high-frequency coverage: Well-suited for use cases that require timely quotes and detailed market activity in U.S. equities.
High performance and streaming: Provides WebSocket streaming and fast REST endpoints, which support responsive applications and real-time monitoring.
Granular historical depth: With the appropriate plan, it offers tick-level history and detailed aggregates that are valuable for advanced backtesting and microstructure analysis.

Disadvantages

U.S.-first scope: Coverage is primarily U.S.-focused, making it not the best fit for projects requiring broad global exchange coverage.
Cost scales quickly for premium access: Real-time entitlements and extensive historical depth are typically available only on higher-priced tiers, which can be more expensive than general-purpose APIs.

Best for

Trading-oriented applications that require high-performance, real-time U.S. market data and benefit from streaming and tick-level history.

Conclusion

The financial data API landscape in 2026 is strong, but there is no single provider that is universally best for every scenario. The most practical approach is to select an API that matches the breadth and reliability you need, then confirm that its rate limits, historical depth, and licensing terms align with your data use.

In 2026, here are the financial data APIs you should know:

Financial Modeling Prep (FMP): A broad, all-in-one API that combines market prices with fundamentals and additional datasets for building complete financial applications.
Alpha Vantage: A simple API that is well-suited for learning and small projects, especially if you want built-in technical indicators.
EOD Historical Data (EODHD): A strong option for global exchange coverage and long historical datasets, with solid paid-plan value and useful add-ons.
Finnhub: A developer-friendly API with generous free-tier limits and a practical mix of quotes, news, sentiment, and market calendars.
Tiingo: A cost-effective choice for clean end-of-day historical data and backtesting, with good U.S. fundamentals on paid tiers.
Twelve Data: A clean multi-asset time series API for stocks, FX, and crypto, designed for straightforward integration and indicator-driven workflows.
Marketstack: A lightweight API for global stock price data with affordable entry pricing, best for basic applications.
Polygon.io (Massive): A high-performance provider focused on real-time and high-frequency U.S. market data, including streaming and granular history.

I hope it has helped!

Batch Screening Fundamentals with Financial Modeling Prep and Streamlit

Cornellius Yudha Wijaya — Wed, 07 Jan 2026 14:16:46 GMT

Photo by Joshua Aragon on Unsplash

Batch screening matters because most real-world financial workflows are not about understanding one company. They are about narrowing down a universe. In practice, you start with a watchlist, an index, or a sector set, then ask simple questions such as: which companies have strong profitability, manageable leverage, and healthy cash generation? That first pass turns an overwhelming list of tickers into a short list you can actually research.

The challenge is that screening requires repetition. If you fetch fundamentals one company at a time, you end up rewriting the same code path for every symbol: call the endpoint, parse the JSON, extract a few fields, compute ratios, and handle missing data. Doing this manually in notebooks does not scale, and it is easy to introduce inconsistencies across analyses.

In this article, we will build a small-batch screening workflow using Financial Modeling Prep’s stable fundamentals endpoints, making it work even on the free tier, pulling the data, and wrapping it in a lightweight Streamlit UI so you can screen companies interactively and export the results for deeper analysis.

Let’s get into it!

Subscribe now

Foundation

You can access the entire code used in this tutorial in this repository.

Batch screening is basically the process of ‘shortlisting’ in fundamental analysis. Rather than analyzing one company at a time, you begin with a list of tickers and systematically apply the same criteria: retrieve fundamentals, calculate several ratios, filter, and rank. Performing this manually in notebooks can become repetitive and may lead to inconsistencies.

A small, structured project helps you standardize the workflow, reuse parsing and ratio logic, and produce a clean output table you can export or integrate into a dashboard.

In this article, we will build a minimal batch fundamentals screener that does three things:

Fetch the latest annual fundamentals for a list of tickers
Compute simple screening metrics (for example, ROE and debt-to-equity)
Filter and display the shortlist in a lightweight Streamlit UI, with an option to export results as CSV.

This is not a complete analytics platform. It is a compact workflow you can reuse whenever you want to screen a set of companies before deeper analysis.

The Data Source

All data comes from Financial Modeling Prep’s stable API, using a single base URL:

https://financialmodelingprep.com/stable

Each function is expressed as an endpoint on this base URL, with parameters passed via query strings. For this screener, we only use a small subset of endpoints focused on company fundamentals:

Income statement (income-statement): revenue, net income, and other income statement fields
Balance sheet (balance-sheet-statement): total assets, total liabilities, and equity fields
Cash flow statement (cash-flow-statement): operating cash flow and other cash flow items

Across these endpoints, we use consistent parameters:

symbol: the ticker (e.g., AAPL)
period: annual (to keep the example simple and consistent)
limit: usually 1 for “latest snapshot” screening (you can extend later to multi-year stability checks)

These three statements are sufficient to reconstruct a basic snapshot of a company’s fundamentals and compute simple screening ratios.

What the Batch Screener Does

Instead of exposing REST endpoints like the previous microservice, this project produces a screening table.

Given a list of tickers, it will:

Pull the latest annual income statement, balance sheet, and cash flow statement for each ticker
Compute a few simple metrics, such as:

ROE = netIncome / totalEquity
Debt-to-Equity = totalLiabilities / totalEquity
Cash flow health using operatingCashFlow (for example, requiring it to be positive)

3. Apply thresholds to filter the universe into a shortlist

4. Display results in Streamlit and allow CSV export for follow-up analysis

Project Architecture

We keep the project small and modular:

fmp_batch_screening/
├─ app/
│  ├─ __init__.py
│  ├─ config.py           # loads env vars (API key + base URL)
│  ├─ bulk_client.py      # fetches statements per ticker (batch via loop)
│  ├─ screening.py        # computes ratios + applies filters
│  └─ streamlit_app.py    # Streamlit UI (inputs, sliders, table, export)
├─ requirements.txt
└─ .env.example

At a high level, the flow is as follows:

Building the Batch Screening

We will start building our batch screening system. We will cover

Step 1: Define dependencies (`requirements.txt`)

Before writing any code, we want to lock down the project dependencies. This keeps the environment reproducible and makes it easy for anyone to install and run the screener.

Create a requirements.txt in the project root:

requests
python-dotenv
pandas
streamlit

Install them in your CLI:

pip install -r requirements.txt

What happens here is that:

requests handles HTTP calls to the FMP API.
python-dotenv loads your .env file into environment variables at runtime.
pandas gives you a table structure (DataFrame) that is perfect for screening, sorting, and filtering.
streamlit lets you turn the batch workflow into a simple UI without building a full web app.

Step 2: Configure environment variables (`.env`)

Next, create a .env file in the project root. This is where you store your API key and base URL. The goal is to keep credentials out of source code and make configuration consistent across scripts.

Create .env:

FMP_API_KEY=your_fmp_api_key_here
FMP_BASE_URL=https://financialmodelingprep.com/stable

The purpose of this is that:

FMP_API_KEY will be injected into each API request as apikey=....
FMP_BASE_URL becomes the single source of truth for endpoint construction.
By using a .env, you can switch keys or URLs without touching any code.

Step 3: Centralize config in `app/config.py`

Instead of reading environment variables in every file, we centralise configuration in one place. This keeps the rest of the codebase clean and avoids duplication.

Create app/config.py:

import os
from dotenv import load_dotenv

load_dotenv()
FMP_API_KEY = os.getenv(”FMP_API_KEY”)
FMP_BASE_URL = os.getenv(
    “FMP_BASE_URL”,
    “https://financialmodelingprep.com/stable”,
).rstrip(”/”)
if not FMP_API_KEY:
    raise RuntimeError(
        “FMP_API_KEY is not set. Please configure it in your .env file.”
    )

Let’s break down what the code above does

load_dotenv() reads your .env file and loads all variables into the environment.
os.getenv("FMP_API_KEY") retrieves the API key for use elsewhere.
FMP_BASE_URL has a default fallback, and .rstrip("/") ensures the URL does not end with /.
This avoids issues like .../stable//income-statement when we later join paths.
The RuntimeError acts as an early “fail fast” check so you don’t waste time debugging missing configuration later.

This file becomes a shared dependency across the rest of the project.

Step 4: Build the batch fundamentals fetcher (`app/bulk_client.py`)

The “batch” problem is not about one API call. It is about applying the same extraction logic consistently across many tickers. Here, we isolate all interactions with FMP into one module that:

fetches the latest annual statements for one ticker, then
loops across many tickers and builds a DataFrame.

Create app/bulk_client.py:

from typing import Any, Dict, List, Optional
import time
import requests
import pandas as pd
from app.config import FMP_API_KEY, FMP_BASE_URL

def fetch_latest_statements(symbol: str) -> Dict[str, Any]:
    “”“
    Fetch the latest annual income statement, balance sheet, and cash flow
    for a single symbol using stable endpoints.
    “”“
    symbol = symbol.upper()
    def _get(endpoint: str, extra_params: Optional[Dict[str, Any]] = None) -> List[Dict[str, Any]]:
        params: Dict[str, Any] = {
            “symbol”: symbol,
            “apikey”: FMP_API_KEY,
            “period”: “annual”,
            “limit”: 1,
        }
        if extra_params:
            params.update(extra_params)
        url = f”{FMP_BASE_URL}/{endpoint}”
        resp = requests.get(url, params=params, timeout=30)
        if not resp.ok:
            raise RuntimeError(
                f”FMP API error ({endpoint}) for {symbol}: “
                f”{resp.status_code} {resp.text[:200]}”
            )
        data = resp.json()
        if isinstance(data, list):
            return data
        if isinstance(data, dict):
            return [data]
        return []
    income_list = _get(”income-statement”)
    balance_list = _get(”balance-sheet-statement”)
    cashflow_list = _get(”cash-flow-statement”)
    income = income_list[0] if income_list else {}
    balance = balance_list[0] if balance_list else {}
    cashflow = cashflow_list[0] if cashflow_list else {}
    return {
        “symbol”: symbol,
        “date”: income.get(”date”) or balance.get(”date”) or cashflow.get(”date”),
        “revenue”: income.get(”revenue”),
        “netIncome”: income.get(”netIncome”),
        “totalAssets”: balance.get(”totalAssets”),
        “totalLiabilities”: balance.get(”totalLiabilities”),
        “totalEquity”: balance.get(”totalStockholdersEquity”) or balance.get(”totalEquity”),
        “operatingCashFlow”: cashflow.get(”operatingCashFlow”),
    }

def fetch_fundamentals_for_symbols(
    symbols: List[str],
    sleep_seconds: float = 0.25,
) -> pd.DataFrame:
    “”“
    Loop over a list of tickers and fetch the latest annual statements for each.
    Returns one DataFrame row per symbol.
    “”“
    cleaned = [s.strip().upper() for s in symbols if s.strip()]
    cleaned = list(dict.fromkeys(cleaned))  # de-duplicate, preserve order
    rows: List[Dict[str, Any]] = []
    for sym in cleaned:
        try:
            rows.append(fetch_latest_statements(sym))
        except Exception as exc:
            print(f”[WARN] Failed for {sym}: {exc}”)
        time.sleep(sleep_seconds)
    return pd.DataFrame(rows) if rows else pd.DataFrame()

This module has two layers: a single symbol and a batch loop.

1) fetch_latest_statements(symbol)

Uppercases the ticker so aapl becomes AAPL.
Defines _get(endpoint, extra_params) as a local helper:
- Builds query parameters (symbol, period=annual, limit=1, plus apikey).
- Constructs the URL using the stable base: f"{FMP_BASE_URL}/{endpoint}".
- Sends a GET request with requests.get(...).
- If the API fails, it raises a clear error showing endpoint + status code + partial body.
- Normalizes responses so you always get a list of dictionaries.
Calls _get(...) three times:
income-statement
balance-sheet-statement
cash-flow-statement
Picks the first result from each list (because limit=1) and flattens only the fields we care about into a single dictionary.

That flattening step is important: instead of returning three raw JSON blobs, we return one consistent “row” suitable for a DataFrame.

2) fetch_fundamentals_for_symbols(symbols)

Cleans the input list:
- removes empty values
- uppercases everything
- de-duplicates (so you don’t waste calls)
Loops over each symbol and calls fetch_latest_statements.
If one symbol fails, it prints a warning but continues the batch. This matters in real screening because a single broken ticker should not halt the entire run.
Sleeps briefly between calls to reduce the chance of rate-limit issues.
Returns a DataFrame with one row per ticker.

At this point, you’ve already converted “many API calls” into one table you can analyze.

Step 5: Compute ratios and build the screening rules (`app/screening.py`)

Raw statements are helpful, but screening is usually based on ratios. Here, we compute a minimal set of metrics from the fetched fields and apply filters to shortlist companies.

Create app/screening.py:

from typing import Dict, Any, Tuple, List
import pandas as pd

from app.bulk_client import fetch_fundamentals_for_symbols

DEFAULT_THRESHOLDS: Dict[str, Any] = {
    “min_roe”: 0.15,
    “max_debt_to_equity”: 0.5,
    “min_operating_cf”: 0.0,
}

def load_universe_with_ratios(symbols: List[str]) -> pd.DataFrame:
    “”“
    Fetch fundamentals and compute:
      - ROE = netIncome / totalEquity
      - Debt-to-Equity = totalLiabilities / totalEquity
    “”“
    df = fetch_fundamentals_for_symbols(symbols)
    if df.empty:
        return df
    def safe_div(num, den):
        try:
            if den is None or den == 0:
                return None
            return float(num) / float(den)
        except (TypeError, ZeroDivisionError):
            return None
    df[”roe”] = [
        safe_div(ni, eq) for ni, eq in zip(df.get(”netIncome”), df.get(”totalEquity”))
    ]
    df[”debt_to_equity”] = [
        safe_div(liab, eq)
        for liab, eq in zip(df.get(”totalLiabilities”), df.get(”totalEquity”))
    ]
    return df

def apply_screen(
    df: pd.DataFrame,
    min_roe: float,
    max_debt_to_equity: float,
    min_operating_cf: float,
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    “”“
    Apply thresholds and return (cleaned_data, shortlist).
    “”“
    required = [”symbol”, “roe”, “debt_to_equity”, “operatingCashFlow”]
    missing = [c for c in required if c not in df.columns]
    if missing:
        return df, pd.DataFrame()
    df_clean = df.dropna(subset=required)
    mask = (
        (df_clean[”roe”] >= min_roe)
        & (df_clean[”debt_to_equity”] <= max_debt_to_equity)
        & (df_clean[”operatingCashFlow”] >= min_operating_cf)
    )
    shortlist = df_clean.loc[mask].copy()
    shortlist = shortlist.sort_values(”roe”, ascending=False)
    return df_clean, shortlist

Let’s break down what happens in the code above.

load_universe_with_ratios(symbols):
- Calls the batch client to get a fundamentals DataFrame.
- Defines safe_div() so ratio calculations do not crash when equity is missing or zero.
Computes: roe from netIncome / totalEquity anddebt_to_equity from totalLiabilities / totalEquity
- Adds those computed values as new DataFrame columns.
apply_screen(...):
- Verifies the required fields exist.
- Drops rows missing key metrics (because screening with None values is meaningless).
- Applies your filter rules (min ROE, max leverage, min operating cash flow).
- Sorts results by ROE so the strongest profitability appears at the top.

This is the “brain” of the screener: you can keep extending it with more metrics later without touching the UI.

Step 6: Build the Streamlit UI (`app/streamlit_app.py`)

Now we expose the batch screener as an interactive app. The user provides the tickers and screening thresholds, then gets a shortlist table and CSV export.

Create app/streamlit_app.py:


import os
import sys

# Ensure project root (parent of “app”) is on sys.path
CURRENT_DIR = os.path.dirname(os.path.abspath(__file__))
PROJECT_ROOT = os.path.dirname(CURRENT_DIR)
if PROJECT_ROOT not in sys.path:
    sys.path.insert(0, PROJECT_ROOT)

import streamlit as st
import pandas as pd

from app.screening import (
    load_universe_with_ratios,
    apply_screen,
    DEFAULT_THRESHOLDS,
)

st.set_page_config(
    page_title=”FMPFundamentals Screener”,
    layout=”wide”,
)


@st.cache_data(show_spinner=True)
def get_universe_cached(symbols: tuple) -> pd.DataFrame:
    # symbols is a tuple here because cache_data needs hashable args
    return load_universe_with_ratios(list(symbols))


def main():
    st.title(”Batch Fundamentals Screener”)
    st.write(
        “This app uses only Financial Modeling Prep endpoints that are typically “
        “available on the **free plan** (annual financial statements). “
        “You provide a list of symbols, and the app fetches the latest annual “
        “income statement, balance sheet, and cash flow to compute basic ratios “
        “such as ROE and Debt-to-Equity.”
    )

    # Sidebar: symbols + criteria
    st.sidebar.header(”Universe & Screening Criteria”)

    default_symbols = “AAPL, MSFT, GOOGL, AMZN, META, NVDA, TSLA, JPM, BAC, NFLX”

    symbols_input = st.sidebar.text_area(
        “Symbols (comma or newline separated)”,
        value=default_symbols,
        help=”Provide a list of tickers to screen. “
             “Example: AAPL, MSFT, GOOGL”,
        height=120,
    )

    min_roe = st.sidebar.slider(
        “Minimum ROE (Net income / Equity, latest annual)”,
        min_value=0.0,
        max_value=0.5,
        value=float(DEFAULT_THRESHOLDS[”min_roe”]),
        step=0.01,
    )

    max_debt_to_equity = st.sidebar.slider(
        “Maximum Debt-to-Equity (Total liabilities / Equity, latest annual)”,
        min_value=0.0,
        max_value=3.0,
        value=float(DEFAULT_THRESHOLDS[”max_debt_to_equity”]),
        step=0.05,
    )

    min_operating_cf = st.sidebar.number_input(
        “Minimum Operating Cash Flow (latest annual, absolute)”,
        value=float(DEFAULT_THRESHOLDS[”min_operating_cf”]),
        step=1_000_000.0,
        format=”%.0f”,
        help=”Set to >0 to require positive operating cash flow.”,
    )

    st.sidebar.markdown(”---”)
    st.sidebar.write(”Edit the symbols and criteria, then click **Run Screening**.”)

    if st.button(”Run Screening”):
        # Parse symbols
        raw = symbols_input.replace(”\n”, “,”)
        symbols = [s.strip().upper() for s in raw.split(”,”) if s.strip()]

        if not symbols:
            st.warning(”Please provide at least one symbol.”)
            return

        try:
            df_universe = get_universe_cached(tuple(symbols))
        except Exception as e:
            st.error(f”Error fetching data from FMP: {e}”)
            return

        if df_universe.empty:
            st.warning(”No data returned from the financial statement endpoints.”)
            return

        st.subheader(”Universe Preview”)
        st.write(
            f”Fetched latest annual statements for **{len(df_universe)}** symbols.”
        )

        st.write(”Columns available (first 20):”)
        st.code(”, “.join(df_universe.columns.tolist()[:20]), language=”text”)

        df_all, df_screened = apply_screen(
            df_universe,
            min_roe=min_roe,
            max_debt_to_equity=max_debt_to_equity,
            min_operating_cf=min_operating_cf,
        )

        if df_screened.empty:
            st.warning(
                “No companies passed the current screening rules. “
                “Try relaxing the filters or inspect the raw data.”
            )
            with st.expander(”Show full dataset”):
                st.dataframe(df_all)
            return

        st.subheader(”Screening Results”)
        st.write(
            f”Companies passing the screen: **{len(df_screened)}**. “
            “Sorted by ROE descending.”
        )

        display_cols = [c for c in [            “symbol”,            “date”,            “roe”,            “debt_to_equity”,            “operatingCashFlow”,            “revenue”,            “netIncome”,            “totalAssets”,            “totalLiabilities”,            “totalEquity”,        ] if c in df_screened.columns]

        st.dataframe(df_screened[display_cols].reset_index(drop=True))

        # Simple bar chart of top N by ROE
        top_n = min(20, len(df_screened))
        chart_df = df_screened.head(top_n)
        if “symbol” in chart_df.columns and “roe” in chart_df.columns:
            st.subheader(f”Top {top_n} by ROE (latest annual)”)
            st.bar_chart(
                chart_df.set_index(”symbol”)[”roe”]
            )

        with st.expander(”Download results as CSV”):
            csv = df_screened.to_csv(index=False)
            st.download_button(
                label=”Download CSV”,
                data=csv,
                file_name=”screened_companies.csv”,
                mime=”text/csv”,
            )

    else:
        st.info(”Provide symbols in the sidebar and click **Run Screening**.”)


if __name__ == “__main__”:
    main()

What happen in our Streamlit UI is:

Let’s break down what happens in the code above.

The sys.path block ensures imports like from app.screening import ... work correctly in Streamlit (because Streamlit executes the file as a script).
The sidebar captures two things:
- a user-defined ticker list
- screening thresholds (ROE, debt-to-equity, operating cash flow)
@st.cache_data caches results for the same ticker list:
- if you adjust only the thresholds, Streamlit reuses the fetched data instead of calling the API again
When you click Run Screening, the app:
- parses tickers into a clean list
- fetches fundamentals and builds a DataFrame
- computes ratios
- applies screening rules
- renders the shortlist table and provides CSV export (plus a simple ROE chart)

Step 7: Run the app

From the project root:

streamlit run app/streamlit_app.py

Once it runs, you can:

paste your own universe of tickers,
adjust thresholds,
export a shortlist for deeper analysis.

That is how we run the batch screening UI we just created. Let’s take a look at the system we just created by accessing it via localhost. If everything runs fine, you will see the screen something like below:

On the left side, you can enter all the information for the screening criteria, while on the right side is where all the information appears after we run the screening.

The result is the preview of the data universe we acquired and the screening results.

As additional features, we have a chart showing the company’s ROE that passes the screening and a button to download the results as CSV files.

That’s all you need to know to build our batch screening in FMP. You can always extend the metrics and add additional information you need.

Conclusion

In this article, we built a lightweight batch fundamentals screener on top of Financial Modeling Prep’s stable API to analyze many companies within a single, consistent workflow.

By combining a small data-fetching layer, simple ratio calculations (such as ROE and debt-to-equity), and a Streamlit interface, we can quickly turn a list of tickers into a shortlist that is easy to review and export.

You can use this project as a starting point for larger screening pipelines and extend it over time with multi-year stability checks, additional metrics, caching, or deeper drill-down views for shortlisted companies.

Building an Open-Source Microservice for Financial Data Retrieval with Financial Modelling Prep

Cornellius Yudha Wijaya — Sat, 06 Dec 2025 05:33:15 GMT

Photo by Growtika on Unsplash

Financial data is one of the datasets that most companies and individuals need. It is sought for because it is helpful in many projects, such as building investment dashboards and portfolio trackers, running valuation and scenario analysis for listed companies, or training machine learning models for financial use cases. In all of these cases, the hard part is rarely “getting the data once.” The hard part is accessing the data cleanly and consistently every time you start a new project.

Financial Modeling Prep’s stable API provides a rich set of endpoints for financial fundamentals: income statements, balance sheets, cash flow statements, profiles, and more. It solves the problem of data source availability.

But there is still a hassle for developers: the APIs are relatively low-level. You have to remember the exact endpoint names, pass the proper query parameters, manage API keys in every script, and repeatedly transform the raw JSON into the handful of fields you actually need for your analysis.

This is where a small microservice comes in handy. Instead of remembering every FMP’s URLs and parameters, centralize that logic in one place and provide a few task-specific endpoints like “search companies,” “get snapshot,” and “get history.” This approach allows us to easily manage the data flow and even customize the overall data structure output.

In this article, we will build a minimal financial microservice on top of Financial Modeling Prep’s stable API. It will not replace a complete analytics platform; instead, it will provide a focused set of endpoints for any follow-up analytical process.

Let’s get into it.

Subscribe now

Foundation

You can access the entire code used in this tutorial in this repository.

Before we move on to the technical part, we need to understand that building a microservice on top of existing APIs offers several practical benefits.

First, you reduce duplication. Transforming and cleaning the responses from FMP is implemented once, tested once, and shared across everything you build.
Second, you gain a single point for the overall information. Configuration of API keys, error handling, rate limiting, and caching can all live in the microservice rather than being reimplemented ad hoc.
Third, you create a more approachable entry point for others on your team. For example, they can request /companies/AAPL/snapshot without needing to read the FMP documentation first.

These are a few benefits you have, primarily when you work as a developer and data scientist, that need consistency across all companies.

The Data Source

Let’s start building our financial microservice. We will begin by deciding which data from FMP we will use. For this project, all the data comes from Financial Modeling Prep’s stable API, where we will work with a single base URL and a consistent naming pattern using the following:

https://financialmodelingprep.com/stable

Every function is expressed as a specific endpoint on this base, with parameters passed as query string parameters.

In this microservice, we only use a small subset of what FMP offers, focusing on the core fundamentals most people need. To keep things simple, the service relies on five primary endpoints:

Company search (search-symbol): Let’s you search by a company name or a partial ticker and returns candidates with symbols, names, exchanges, and currencies.
Company profile (profile): Returns basic information such as company name, exchange, currency, and other metadata.
Income statement (income-statement): Provides revenue, net income, and other income-statement fields over time.
Balance sheet statement (balance-sheet-statement): Provides total assets, total liabilities, and other balance sheet fields.
Cash flow statement (cash-flow-statement): Provides operating cash flow and other cash flow items.

Each of these endpoints will support parameters like:

symbol which is the ticker (e.g. AAPL),
period like annual or quarterly,
limit which is the number of records you want (e.g., the last 5 years).

These data are enough to reconstruct a basic picture of a company’s fundamentals.

What the Financial Microservice does

In this project, we will develop a consistent REST API for the microservice:

GET /health: basic health check.
GET /companies/search?q=...: search companies by name/symbol.
GET /companies/{symbol}/snapshot: latest fundamentals snapshot (revenue, net income, assets, liabilities, operating cash flow, plus basic profile).
GET /companies/{symbol}/history?years=N: simple time series of revenue and net income for the last N annual periods.

These endpoints will abstract the FMP URL details, the API key management, and the raw JSON shape. The endpoint itself is the minimum version, so it does not cover any complex authentication, database management, or advanced applications.

Project architecture

For the project architecture, we will follow the structure below:

fmp_microservice_financial/
├─ app/
│  ├─ __init__.py
│  ├─ main.py          # FastAPI app + routes
│  ├─ fmp_client.py    # Wrapper around FMP stable API
│  └─ schemas.py       # Pydantic models for responses
├─ requirements.txt
├─ .env.example
└─ Dockerfile

At the high level, the microservice will have the flow like below:

Financial Microservice Financial high-level

Building Financial Microservice

Let’s start by filling up the requirements.txt A file that will contain all the necessary Python libraries we will use to build the financial microservice.

fastapi
uvicorn
requests
python-dotenv
pydantic

Based on the requirements, we will use FastAPI to build our endpoint and Pydantic to define the JSON output schema.

Next, we will set up the .env file to accommodate all the environmental variables used in this project. One requirement is the FMP Free API key, which you can obtain in the FMP dashboard. Once you have the API key, we fill the file using the following information:

FMP_API_KEY=FMP_API_KEY
FMP_BASE_URL=https://financialmodelingprep.com/stable

With the configuration done, we will set up the microservice application.

Building the FMP Client

We will start with the client to wrap the FMP API. To keep the rest of the microservice clean, we isolate all interactions with Financial Modeling Prep in a single class called FMPClient. This class knows how to read configuration, build URLs, attach the API key, and handle errors. Everything else in the codebase just calls methods like get_income_statement(”AAPL”) without worrying about the complex details.

Access the fmp_client.py file and fill them with the following code:

import os
from typing import Any, Dict, List, Optional
import requests
from dotenv import load_dotenv

load_dotenv()

FMP_API_KEY = os.getenv(”FMP_API_KEY”)
FMP_BASE_URL = os.getenv(”FMP_BASE_URL”, “https://financialmodelingprep.com/stable”)

if not FMP_API_KEY:
    raise RuntimeError(
        “FMP_API_KEY is not set. Please configure it in your environment or .env file.”
    )

class FMPClient:
    “”“
    Thin wrapper over Financial Modeling Prep stable endpoints.

    Base: https://financialmodelingprep.com/stable
    Examples:
      - /search-symbol?query=AAPL&apikey=...
      - /income-statement?symbol=AAPL&period=annual&limit=5&apikey=...
    “”“

    def __init__(self, api_key: str = FMP_API_KEY, base_url: str = FMP_BASE_URL) -> None:
        self.api_key = api_key
        self.base_url = base_url.rstrip(”/”)

    def _get(self, endpoint: str, params: Optional[Dict[str, Any]] = None) -> Any:
        “”“
        endpoint: e.g. ‘search-symbol’, ‘income-statement’, ‘profile’
        “”“
        if params is None:
            params = {}
        params[”apikey”] = self.api_key

        url = f”{self.base_url}/{endpoint.lstrip(’/’)}”
        resp = requests.get(url, params=params, timeout=10)

        if not resp.ok:
            raise RuntimeError(
                f”FMP API error: {resp.status_code} {resp.text[:200]}”
            )
        return resp.json()

    def search_symbol(self, query: str, limit: int = 10) -> List[Dict[str, Any]]:
        “”“
        https://financialmodelingprep.com/stable/search-symbol?query=...&limit=...&exchange=...
        “”“
        return self._get(
            “search-symbol”,
            {
                “query”: query,
                “limit”: limit,
                # you can adjust or drop the exchange filter
                “exchange”: “NASDAQ,NYSE,AMEX”,
            },
        )

    def get_company_profile(self, symbol: str) -> List[Dict[str, Any]]:
        “”“
        https://financialmodelingprep.com/stable/profile?symbol=AAPL
        “”“
        return self._get(
            “profile”,
            {”symbol”: symbol.upper()},
        )

    def get_income_statement(
        self,
        symbol: str,
        period: str = “annual”,
        limit: int = 5,
    ) -> List[Dict[str, Any]]:
        “”“
        https://financialmodelingprep.com/stable/income-statement?symbol=AAPL&period=annual&limit=5
        “”“
        return self._get(
            “income-statement”,
            {
                “symbol”: symbol.upper(),
                “period”: period,
                “limit”: limit,
            },
        )

    def get_balance_sheet(
        self,
        symbol: str,
        period: str = “annual”,
        limit: int = 5,
    ) -> List[Dict[str, Any]]:
        “”“
        https://financialmodelingprep.com/stable/balance-sheet-statement?symbol=AAPL&period=annual&limit=5
        “”“
        return self._get(
            “balance-sheet-statement”,
            {
                “symbol”: symbol.upper(),
                “period”: period,
                “limit”: limit,
            },
        )

    def get_cash_flow(
        self,
        symbol: str,
        period: str = “annual”,
        limit: int = 5,
    ) -> List[Dict[str, Any]]:
        “”“
        https://financialmodelingprep.com/stable/cash-flow-statement?symbol=AAPL&period=annual&limit=5
        “”“
        return self._get(
            “cash-flow-statement”,
            {
                “symbol”: symbol.upper(),
                “period”: period,
                “limit”: limit,
            },
        )

Let’s break down what happens in the code above. The first few lines are just to set up imports and load the configuration, where we specify the base URL to use for all API calls and the API key to attach.

Next, we define the FMPClient class as a thin wrapper that encapsulates how to call FMP. The api_key and base_url are initialized from the module-level variables, but can be overridden when instantiating the class. Also, base_url.rstrip(”/”) ensures there is no trailing slash on the base URL. This makes it easier to concatenate safely base_url and endpoint names without accidentally creating double slashes.

Then, we define the shared helper utility _get function, which will be used by the other functions within the FMPClient class.

def _get(self, endpoint: str, params: Optional[Dict[str, Any]] = None) -> Any:

The function will accept the endpoint name we set, such as “search-symbol” or “income-statement”. It will also take an optional params dictionary and ensure one crucial parameter is always present, which is theapikey. The main activity of the function will construct the valid URL and send a GET request usingrequests.getthat returnsresp.json()the parsed JSON body from FMP.

The rest of the class defines small, descriptive methods for specific FMP endpoints. For example the “search-symbol”:

def search_symbol(self, query: str, limit: int = 10) -> List[Dict[str, Any]]:

For the function, we could pass parameters such as the free-text query and an optional limit. The function will call _get with the endpoint name “search-symbol” and a parameters dictionary.

From the rest of your code, you can write:

client.search_symbol(”AAPL”)

And get back a list of candidate companies without worrying about URLs or query string details.

This client will centralize our configuration and error handling and provide the high-level vocabulary for our microservice.

Building the Microservice Schema

To keep the output consistent, the microservice does not expose raw JSON from FMP directly. Instead, we define a small set of Pydantic models that precisely describe the fields clients can expect from each endpoint, independent of how FMP structures its responses. This is where we will define them at theschemas.py with the following code:

from typing import List, Optional
from pydantic import BaseModel, Field

class CompanySearchItem(BaseModel):
    symbol: str
    name: str
    exchange: Optional[str] = None
    currency: Optional[str] = None

class CompanySearchResponse(BaseModel):
    results: List[CompanySearchItem]

class IncomeSnapshot(BaseModel):
    revenue: Optional[float] = Field(
        None, description=”Total revenue for the period”
    )
    netIncome: Optional[float] = Field(
        None, description=”Net income for the period”
    )

class BalanceSheetSnapshot(BaseModel):
    totalAssets: Optional[float] = None
    totalLiabilities: Optional[float] = None

class CashFlowSnapshot(BaseModel):
    operatingCashFlow: Optional[float] = None

class CompanySnapshot(BaseModel):
    symbol: str
    name: Optional[str] = None
    currency: Optional[str] = None
    exchange: Optional[str] = None
    asOf: Optional[str] = Field(
        None, description=”Financial statement date”
    )

    income: IncomeSnapshot
    balanceSheet: BalanceSheetSnapshot
    cashFlow: CashFlowSnapshot

class HistoryPoint(BaseModel):
    date: str
    revenue: Optional[float] = None
    netIncome: Optional[float] = None

class CompanyHistoryResponse(BaseModel):
    symbol: str
    points: List[HistoryPoint]

These Pydantic schema models help define our microservice public interface, even when FMP’s response changes, create API self-documentation (with Swagger UI), and keep our microservices focused as we decide the output structure.

You can also change the schema above as needed. What is important is that you understand the FMP outputs and understand the result you want. These schema models will be used together with the client we set up previously in the application, which we set up at the main.py.

Building the Microservice Application

The main.py file is where the microservice becomes a real API that we can call elsewhere. We can define them as follows:

from typing import List
from fastapi import Depends, FastAPI, HTTPException, Query
from fastapi.responses import JSONResponse
from app.fmp_client import FMPClient
from app.schemas import (
    CompanySearchItem,
    CompanySearchResponse,
    CompanySnapshot,
    IncomeSnapshot,
    BalanceSheetSnapshot,
    CashFlowSnapshot,
    HistoryPoint,
    CompanyHistoryResponse,
)

app = FastAPI(
    title=”Company Fundamentals Microservice”,
    version=”0.1.0”,
    description=(
        “Minimal open-source service that wraps Financial Modeling Prep “
        “stable fundamentals endpoints.”
    ),
)

def get_client() -> FMPClient:
    return FMPClient()

@app.get(”/health”)
def health_check() -> dict:
    return {”status”: “ok”}

@app.get(
    “/companies/search”,
    response_model=CompanySearchResponse,
    summary=”Search for companies by name or symbol”,
)
def search_companies(
    q: str = Query(..., min_length=1, description=”Search query”),
    limit: int = Query(10, ge=1, le=50),
    client: FMPClient = Depends(get_client),
):
    raw = client.search_symbol(q, limit=limit)
    results: List[CompanySearchItem] = []

    for item in raw:
        results.append(
            CompanySearchItem(
                symbol=item.get(”symbol”),
                name=item.get(”name”) or item.get(”companyName”),
                exchange=item.get(”stockExchange”),
                currency=item.get(”currency”),
            )
        )

    return CompanySearchResponse(results=results)

@app.get(
    “/companies/{symbol}/snapshot”,
    response_model=CompanySnapshot,
    summary=”Latest fundamentals snapshot for a given company”,
)
def company_snapshot(
    symbol: str,
    client: FMPClient = Depends(get_client),
):
    profiles = client.get_company_profile(symbol)
    if not profiles:
        raise HTTPException(status_code=404, detail=”Company profile not found”)

    profile = profiles[0]
    name = profile.get(”companyName”) or profile.get(”name”)
    currency = profile.get(”currency”)
    exchange = profile.get(”exchangeShortName”) or profile.get(”exchange”)

    income_list = client.get_income_statement(symbol, period=”annual”, limit=1)
    balance_list = client.get_balance_sheet(symbol, period=”annual”, limit=1)
    cashflow_list = client.get_cash_flow(symbol, period=”annual”, limit=1)

    income_raw = income_list[0] if income_list else {}
    balance_raw = balance_list[0] if balance_list else {}
    cashflow_raw = cashflow_list[0] if cashflow_list else {}

    as_of = (
        income_raw.get(”date”)
        or balance_raw.get(”date”)
        or cashflow_raw.get(”date”)
    )

    income = IncomeSnapshot(
        revenue=income_raw.get(”revenue”) or income_raw.get(”revenueTTM”),
        netIncome=income_raw.get(”netIncome”) or income_raw.get(”netIncomeTTM”),
    )

    balance = BalanceSheetSnapshot(
        totalAssets=balance_raw.get(”totalAssets”),
        totalLiabilities=balance_raw.get(”totalLiabilities”),
    )

    cashflow = CashFlowSnapshot(
        operatingCashFlow=cashflow_raw.get(”operatingCashFlow”)
        or cashflow_raw.get(”operatingCashFlowTTM”)
    )

    snapshot = CompanySnapshot(
        symbol=str(symbol).upper(),
        name=name,
        currency=currency,
        exchange=exchange,
        asOf=as_of,
        income=income,
        balanceSheet=balance,
        cashFlow=cashflow,
    )

    return snapshot

@app.get(
    “/companies/{symbol}/history”,
    response_model=CompanyHistoryResponse,
    summary=”Simple revenue/net income history for charting”,
)
def company_history(
    symbol: str,
    years: int = Query(5, ge=1, le=20),
    client: FMPClient = Depends(get_client),
):
    income_list = client.get_income_statement(
        symbol, period=”annual”, limit=years
    )

    if not income_list:
        raise HTTPException(status_code=404, detail=”No income statement data found”)

    points: List[HistoryPoint] = []
    for row in income_list:
        points.append(
            HistoryPoint(
                date=row.get(”date”),
                revenue=row.get(”revenue”),
                netIncome=row.get(”netIncome”),
            )
        )

    return CompanyHistoryResponse(symbol=str(symbol).upper(), points=points)

@app.exception_handler(RuntimeError)
def runtime_error_handler(request, exc: RuntimeError):
    return JSONResponse(
        status_code=502,
        content={”detail”: str(exc)},
    )

Let’s break down what happens in the code above.

First, we initiate the FastAPI application with metadata, including title, version, and description which will be used in the auto-generated Swagger UI at /docs.

Next, we inject the FMP client into the get_client function that tells FastAPI how to create an FMPClient when an endpoint needs one.

def get_client() -> FMPClient:
    return FMPClient()

Later, in each route, you will see:

client: FMPClient = Depends(get_client)

This makes it easier to construct the client, and it becomes easier to swap in a mock client for testing.

With the application created, we will set up the endpoint route. Each endpoint will have different information we could acquire. For example, the /companies/{symbol}/snapshot route will return the company’s fundamental information:

@app.get(
    “/companies/{symbol}/snapshot”,
    response_model=CompanySnapshot,
    summary=”Latest fundamentals snapshot for a given company”,
)
def company_snapshot(
    symbol: str,
    client: FMPClient = Depends(get_client),
):

The endpoint will basically perform five steps, including:

Fetch basic profile
Fetch the latest financial statements
Determine the “as of” date
Build the snapshot components
Assemble the CompanySnapshot

The endpoint returns this CompanySnapshot. FastAPI serializes it to JSON and automatically documents it.

Running the Microservice

With the application in place, let’s test the microservice. We can do that by running the following command in the CLI:

uvicorn app.main:app --reload

If it’s run correctly, you should see the information like below in your CLI:

INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
INFO:     Started reloader process [9084] using WatchFiles
INFO:     Started server process [27492]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

Let’s check the microservice we just created. As we have been setting up the documentation along the way, we could access them using the following URI in your browser:

http://localhost:8000/docs

Access the URI above, and you will see our microservice documentation below:

Try to check out one of the endpoints, for example, the /health endpoint:

We can see that the endpoint executes correctly and returns the expected response.

Let’s try out the other endpoint, such as /companies/{symbol}/snapshot to acquire the company’s financial fundamentals:

From the image above, we can see that the microservice successfully accesses multiple FMP endpoints and provides the concise output necessary for our work.

Microservice Containerization

Lastly, we will containerize our microservice. So far, we have a working microservice that runs locally. That’s fine for development, but as soon as you want to share the service with someone else or deploy it somewhere other than your laptop, we will run into dependency issues.

Containerizing the service with Docker provides a self-contained, reproducible environment that anyone with Docker can run, regardless of their local setup.

To perform Docker containerization, you need to install Docker Desktop initially. Then, fill the Dockerfile file with the following code:

ROM python:3.11-slim

ENV PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1

WORKDIR /code

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY app ./app

EXPOSE 8000

# Run the FastAPI app with uvicorn
CMD [”uvicorn”, “app.main:app”, “--host”, “0.0.0.0”, “--port”, “8000”]

Next, we will build the Docker image with the following command:

docker build -t microservice-financial-service .

The build above will result in the reusable image we can use and share with others. Assuming your .env have appropriately filled, we can run the container with the following command:

docker run --env-file .env -p 8000:8000 microservice-financial-service

Then, visit the http://localhost:8000/docs once more to access the Microservice documentation.

With the microservice running in the container, we can test it out in the Jupyter Notebook with the following code:

import requests

BASE_URL = “http://127.0.0.1:8000”
symbol = “AAPL”

response = requests.get(f”{BASE_URL}/companies/{symbol}/snapshot”)
print(”Status:”, response.status_code)
snapshot = response.json()
snapshot

The output result looks like this:

Status: 200
{’symbol’: ‘AAPL’,
 ‘name’: ‘Apple Inc.’,
 ‘currency’: ‘USD’,
 ‘exchange’: ‘NASDAQ’,
 ‘asOf’: ‘2025-09-27’,
 ‘income’: {’revenue’: 416161000000.0, ‘netIncome’: 112010000000.0},
 ‘balanceSheet’: {’totalAssets’: 359241000000.0,
  ‘totalLiabilities’: 285508000000.0},
 ‘cashFlow’: {’operatingCashFlow’: 111482000000.0}}

Overall, our microservice financial with FMP works well and is ready to use for any follow-up actions.

Conclusion

In this article, we have turned Financial Modelling Prep’s stable API into a small and reusable microservice that better meets our company’s needs than the raw endpoints.

By wrapping core functions such as search, snapshot, and history in FastAPI, Pydantic schemas, and a lightweight Docker image, we now have a straightforward, well-defined interface for our data acquisition.

You can use this as a drop-in data layer for notebooks, dashboards, or internal tools, and expand it over time with new endpoints, caching, or authentication as your use cases develop.

Introduction to Open‑Source Image Generation Models: A Beginner’s Guide

Cornellius Yudha Wijaya — Sun, 09 Nov 2025 12:47:29 GMT

Image by Author | Ideogram.ai

Introduction

Open‑source image generation models are AI tools that create pictures based on text descriptions, and they are freely available for anyone to use or modify. In simple terms, you can type in a prompt (for example, “a medieval knight on a horse at sunset”), and the model will generate an image matching that description.

These models rose to prominence around 2022 when AI image generators went mainstream. First with OpenAI’s proprietary DALL‑E 2, and soon after with the open-source Stable Diffusion model released by Stability AI.

Unlike closed systems (such as Midjourney or DALL‑E, which you can only access via paid services or APIs), open-source models have no paywalls or strict usage rules, allowing anyone to run them locally or in the cloud without the typical costs or restrictions of proprietary software.

In this article, we will explore Open‑Source Image Generation Models further and how you can navigate them.

Let’s get into it.

Subscribe now

Key Advantages

Open-source image generation models are powerful AI art tools that put creative control directly in the users’ hands, free of charge and open for customization by the community.

There are many advantages to using the open-source image generation models, including:

Cost Efficiency: These models are available without licensing fees or subscription costs. You can run them on your own hardware or affordable cloud instances, avoiding the pay-per-image charges of some commercial services. In short, aside from hardware or electricity, generating images with an open model is practically free.
Flexibility & Customization: Since the code and weights are open, you have the freedom to customize the model to suit your needs. You can adjust parameters, change the model’s code, or even fine-tune it on your own images to create a specific style. This allows developers or artists to build the tool according to their vision rather than being limited to a generic service. For example, developers have made custom versions of Stable Diffusion for medical imaging, anime art, interior design, and more – all made possible by the flexible open license.
Transparency (Trust & Understanding): Open-source models enable anyone to see how they work internally. The model’s architecture and training data can be scrutinized for biases or problems, which helps build trust. There’s no hidden "secret sauce" behind closed doors, as researchers and users can review the model’s behavior and make sure it isn’t doing anything harmful. This openness also encourages learning; students and engineers can study actual, cutting-edge model code to improve their understanding of AI.
Community-Driven Innovation: A vibrant community surrounds these models, leading to rapid updates and contributions worldwide. Developers share features, improvements, and fixes, allowing open models to advance faster than proprietary ones. For example, the Stable Diffusion community has developed a broad ecosystem of plugins, enhancements, and fine-tuned checkpoints. Many community-trained versions are available online for various aesthetics or tasks. This collaborative environment means that if you face a problem or seek a new feature, a solution is likely already available or in progress.
No Hard Usage Limits: Unlike some proprietary tools that may limit the number of images you can generate or impose content restrictions, open-source tools allow you to generate as many as your hardware can support. There’s no rate limiting or mandatory censorship built into the model itself.
Educational Value: Open models are a great resource for education and research. Students, researchers, or anyone interested can experiment with them to learn about AI image creation. Since everything is accessible, you can observe how modifying the code or training data influences the results, which is very helpful for understanding machine learning. This open access speeds up progress in both academia and industry in generative AI.

These are the benefits you can expect from using the open-source image generation model. However, there are still challenges that come with using these models.

Disadvantages and Challenges

Despite their many benefits, open-source image generation models also present some challenges and drawbacks that users should consider:

High Hardware Requirements: Running advanced image models requires a powerful computer, ideally a modern GPU with ample VRAM. Generating high-resolution or multiple images can be resource-intensive, making it difficult for basic laptops or phones to run models like Stable Diffusion locally. Users may need hardware upgrades or cloud services for good performance. (For example, generating a 512×512 image typically needs a GPU with 4–8 GB VRAM and can take several seconds.)
Technical Complexity: The open-source community aims to make these tools user-friendly, but they aren’t always plug-and-play. Setting up and running a model might involve working with Python environments, drivers, and command-line interfaces, which can intimidate beginners. The popular UI has many features, which can overwhelm new users. Using open models fully often requires technical knowledge, and troubleshooting issues like installation errors or GPU incompatibilities is part of learning. Advanced features like training custom models or chaining multiple models need even more expertise.
Quality Limitations and Trade-offs: Open models can produce impressive results but aren't perfect, sometimes generating artifacts or errors like distorted hands or text. Outputs vary, as you may need to adjust prompts or settings. While proprietary models like MidJourney are optimized for specific styles, open models may require extra tuning. Sometimes the images look great but lack logical consistency, as models mimic patterns without understanding scenes. Expect trial and error for the desired quality.
Ethical Concerns (Bias and Misuse): Open models learn from large datasets that can contain biases, leading to skewed representations, especially if certain demographics are overrepresented. They lack filters to prevent harmful content, raising ethical concerns about misuse, such as generating violent or misleading images. While open-source freedom enables innovation, it also allows malicious use, creating a double-edged sword.
Legal and Copyright Questions: There are debates about the legality of images from these models, as their training data often includes copyrighted images scraped from the web without permission. This raises lawsuits and uncertainty over infringement when outputs mimic styles or images closely. Commercial use of AI art might face legal issues until laws are updated. Unlike proprietary services that ban generating images of real people or copyrighted characters, open models can do whatever is asked, risking legal trouble if used improperly. It’s important to stay informed about legal changes and use the technology ethically.

These are the challenges and disadvantages we can encounter if we are using the open-source image generation model.

How Does an Open-Source Image Generation Model Work?

Under the hood, most modern open-source image generators use a process called diffusion to create images. In simple terms, the model starts with a field of random noise and gradually refines it into a coherent picture that matches your prompt.

Diffusion models are a type of AI algorithm within the category of generative models, created to generate new data from existing data. Specifically, in diffusion models, this allows the creation of new images based on the input given.

For diffusion models, the process differs from traditional methods, as it involves adding and then removing noise from the data. Essentially, the model modifies the images and refines them to generate the final output. Think of it as a denoising process where the model learns to remove noise from images.

The diffusion model was originally introduced in the paper 'Deep Unsupervised Learning using Nonequilibrium Thermodynamics' by Sohl-Dickstein et al. (2015). It describes converting data into noise via a controlled forward diffusion process and training a model to reverse this process, reconstructing the data through denoising.

Building on this foundation, Ho et al. (2020) in their paper "Denoising Diffusion Probabilistic Models" introduce the modern diffusion framework, capable of generating high-quality images and surpassing earlier popular models such as Generative Adversarial Networks (GANs). Typically, the diffusion model involves two essential stages:

Forward (diffusion) process: Data is progressively corrupted by noise addition until it appears as random static.
Reverse (denoising) process: Involves training a neural network to gradually eliminate noise and learn to reconstruct image data starting from pure randomness.

In practice, these steps are performed in latent space using a variational autoencoder (VAE): the model denoises compact latent representations and then decodes them back to pixels. Let’s now examine the components of the diffusion model more closely to make this concrete.

Forward Process

The forward process is the first phase, where the images are systematically degraded by noise until they become random static.

The forward process is controlled and iterative, which we can summarize in the following steps:

Begin with an image dataset
Add a small amount of noise to the image.
Repeat this process many times, possibly hundreds or thousands of times, each time further corrupting the image.
After enough steps, the original image will become just pure noise.

The process described above is often represented mathematically as a Markov chain because each noisy version depends only on the one right before it, not on the full sequence of steps.

Why do we gradually turn the image into noise instead of doing it all at once? Our goal in the forward process is to help the model learn to reverse the corruption step by step. Using gradual steps allows the model to learn how to go from noisy data to clearer data. This method helps the model rebuild the image by learning little by little through the process of adding noise.

To determine how much noise is added to the step, the concept of the schedule is used. For example, linear schedules gradually introduce noise over time, while cosine schedules add noise more slowly and maintain useful image features for a longer duration.

That’s a quick summary of the Forward Process. Let’s explore the Reverse Process further.

Reverse Process

The subsequent step after the forward process involves transforming the model into a generator that learns to convert noise into image data. Through small, iterative adjustments, the model can generate new, previously nonexistent images.

In general, the reverse process is the inverse of the forward process, where:

Begin with pure noise, which is an entirely random image made up of Gaussian noise.
Iteratively remove noise with a trained model that simulates reversing each forward step. In every iteration, the model receives the current noisy image and its timestep, then predicts how to lower the noise level based on what it learned during training.
Gradually, the image becomes clearer, resulting in usable image data.

This reverse process depends on a well-trained model that can effectively denoise noisy images. Diffusion models typically employ a neural network architecture like a U-Net, which functions as an autoencoder with convolutional layers in an encoder–decoder setup. During training, the model learns to predict the noise added in the forward process. At each step, it also takes the timestep into account, enabling it to modify its predictions according to the noise level.

The model is usually trained with a loss function like mean squared error (MSE), which measures the difference between predicted and actual noise. By reducing this loss across many examples, the model gradually becomes skilled at reversing the diffusion process.

Compared to options like Generative Adversarial Networks (GANs), diffusion models provide greater stability and a simpler generative process. The step-by-step denoising method results in more expressive learning, making training more reliable and easier to understand.

Once the model is fully trained, creating a new image follows the reverse process summarized above.

Text Conditioning

In many open-source image generation models, these systems can guide the reverse process using text prompts, which we call text conditioning. By incorporating natural language, we get a matching scene instead of random visuals.

The system uses a pre-trained text encoder (such as CLIP Text; SDXL variants also utilize OpenCLIP or T5) to convert the prompt into a vector or sequence of embeddings. These embeddings are then fed into the diffusion U-Net through cross-attention, enabling the network to concentrate on relevant words and phrases as it denoises. During each step of the reverse process, the model references both the current noisy sample and the text embeddings, employing cross-attention to align emerging visual features with the prompt’s semantics.

Many implementations also use classifier-free guidance (CFG): the network blends unconditional and conditional predictions, with a guidance scale determining how closely the image follows the prompt. In latent-diffusion setups, all conditioning occurs in latent space, and a VAE decoder then converts the final latent back into pixels.

Share Non-Brand Data

Notable Open-Source Text-to-Image Models (2025)

Stable Diffusion v1.5 – The original Stable Diffusion (by CompVis/StabilityAI) is a latent diffusion text-to-image model capable of generating photorealistic images from text prompts.
Stable Diffusion v2.1 – A newer StabilityAI release, SD v2.1, is a refined latent diffusion model (768×768) that also creates and edits images from text.
Stable Diffusion 3 Medium (MMDiT) – A mid-sized “Stable Diffusion 3” model utilizing the new Multimodal Diffusion Transformer (MMDiT) architecture.
Stable Diffusion 3.5 Large (MMDiT) – A larger MMDiT version of Stable Diffusion 3, optimized for top quality. SD3.5 Large 'offers improved performance in image quality, typography, complex prompt understanding, and resource efficiency.”
Stable Diffusion XL 1.0 (base) – The flagship high-capacity SDXL model. The SDXL 1.0 base model is a latent diffusion model using two large CLIP text encoders (ViT-G and ViT-L) to handle nuanced prompts.
SDXL-Lightning (ByteDance) – A research model by ByteDance that distills Stable Diffusion XL for speed. SDXL-Lightning “is a lightning-fast text-to-image generation model” that can produce 1024px images in only a few diffusion steps.
FLUX.1 (Black Forest Labs) – A modern open-weights rectified-flow transformer (≈12B params) for high-fidelity text-to-image. Strong prompt following and DiT-style efficiency.
Playground v2.5 (Playground AI) – An SDXL-style latent-diffusion base tuned for aesthetic 1024×1024 results and robust aspect ratios.
HunyuanImage-3.0 (Tencent) – A native multimodal open-weights system whose text-to-image module targets parity with leading closed models; active, fast-moving repo with inference code and weights.
PixArt-Σ (PixArt-alpha) – A Diffusion-Transformer (DiT) base that can generate up to 4K directly in a single sampling pass; an influential open alternative to UNet-based LDMs.

Each of the above models is open-source and still widely used, and is able to improve your work.

That’s all for the simple introduction to the Open‑Source Image Generation Models. If you like the article, don’t forget to share and comment.

Leave a comment

14 Portfolio Projects That Demonstrate Real Business Value

Cornellius Yudha Wijaya — Tue, 28 Oct 2025 14:30:20 GMT

Image by Author | Ideogram.ai

We live in an era when data has become a commodity that every business wants to use. That’s why there are many companies willing to pay a lot of money to have the best data scientist.

With numerous competitions happening, the best way to stand out is by having data science portfolios that address real business problems with measurable results.

Below are 14 real–world–inspired projects you can take inspiration from. Each project shows the strategic problem, the approach, measurable impact, and deployment in production.

Curious about it? Let’s get into it.

Subscribe now

Python 3.14: 12 Features You Can Use Today

Cornellius Yudha Wijaya — Wed, 22 Oct 2025 14:07:37 GMT

Image by Author | Ideogram.ai

Python 3.14 has now been released, bringing a mix of improvements to the language, its implementation, and the standard library. Many of the biggest changes sharpen the language’s tools, boost developer ergonomics, and open doors to new capabilities without forcing you to rewrite your code.

In this article, we highlight 12 new features and enhancements in Python 3.14 that are particularly useful for data scientists and Python developers, focusing on practical benefits in data manipulation, performance, and everyday development.

Each feature below is presented with a brief explanation of what it is, why it matters, and an example (where applicable) showing how you can start using it today.

Let’s get into it!

Subscribe now

1. Colorful Interactive REPL

One of the first things you’ll notice in Python 3.14 is a friendlier interactive shell (REPL). The default Python REPL now highlights Python syntax in color, making code easier to read as you type. Keywords, built-ins, and other syntax elements are colored by default, improving the interactive coding experience. This enhancement helps you spot syntax errors or typos faster and provides a more intuitive, IDE-like feel when working in the terminal.

In addition to the REPL, several command-line interfaces in the standard library (such as unittest, argparse, json, and others) now support colored output. This means that running your tests or parsing arguments can produce color-coded text (for example, highlighting errors or important information) without any extra configuration. All these improvements contribute to a more pleasant and productive development workflow right out of the box.

2. More Helpful Error Messages

Python 3.14 continues the recent trend of improving error messages to be more descriptive and helpful. The interpreter can now often guess your mistakes and suggest fixes. For example, if you accidentally mistype a Python keyword, the error will include a suggestion:

whille True:
   pass
SyntaxError: invalid syntax. Did you mean ‘while’?

In this case, Python noticed the misspelling and helpfully suggested the correct keyword. This saves developers time in tracking down simple typos. Similar improvements have been made for other common mistakes. For instance, using an elif after an else block now yields a clear error (“’elif’ block follows an ‘else’ block”), and using the wrong prefix on string literals (like ub’...’) will tell you that certain prefixes are incompatible.

Error messages for runtime issues have also been polished. If you try to add an unhashable type to a set or use a list as a dict key, the TypeError will explicitly state which type is unhashable (e.g., “cannot use ‘dict’ as a set element (unhashable type: ‘dict’)”). Overall, these clearer error messages guide you toward fixes faster, making debugging and development more efficient.

3. Safe Live Debugging (Attach to Running Processes)

Debugging long-running processes just got easier and safer. Python 3.14 introduces a zero-overhead debugging interface (PEP 768) that allows debuggers and profilers to attach to a running Python process without pausing or altering its execution. In practical terms, this means you can inspect and debug a live Python program (even in production) without needing to start it under a debugger from the beginning.

One direct benefit of this feature is that the built-in Python debugger pdb can now attach to an existing process. For example, you can attach to a process with ID 12345 by running:

python -m pdb -p 12345

This will connect a pdb session to the running program identified by that PID (process ID). Previously, such capability wasn’t available as we had to anticipate debugging needs by starting the program with pdb or use external tools. Now, Python 3.14 provides a safe hook for live debugging, so you can investigate issues on the fly. Under the hood, this is enabled by a new sys.remote_exec() function and a carefully designed attach protocol, but you don’t need to know those details to use it.

The key takeaway is that debugging and profiling in production or long-running jobs is much more feasible, which is a big win for reliability and developer ergonomics.

4. Template Strings (T-Strings) for Custom String Processing

Python 3.14 introduces template string literals, also known as t-strings, providing a safer and more flexible way to perform string interpolation. Syntactically, t-strings look just like f-strings except they use a t prefix instead of f. For example:

>>> name = “Alice”
>>> template = t”Hello, {name}!”
>>> type(template)

>>> list(template)
[’Hello, ‘, Interpolation(’Alice’, ‘name’, None, ‘’), ‘!’]

Unlike an f-string, which immediately produces a plain string, a t-string evaluates to a Template object (defined in the new string .templatelib module) that contains the static parts and the interpolated parts separately. In the example above, the template holds the literal text and an Interpolation object for the {name} placeholder, including both the value and the original expression. This separation allows you to manipulate or validate the interpolated parts before combining them into a final string.

Why is this useful? Template strings enable safer string processing patterns. You can build functions to escape or validate interpolated values (e.g. to prevent HTML or SQL injection) before rendering the final string. They open the door to custom domain-specific languages: for instance, you could implement an html() function that takes a Template and produces an HTML-safe string by escaping any dangerous characters in the interpolations. In short, t-strings give you the convenience of f-strings with an extra layer of control over how placeholders are handled. This is particularly useful in data science or web applications where you often need to dynamically generate strings but must be careful about sanitizing inputs.

5. Cleaner Exception Handling Syntax

Dealing with exceptions becomes a bit cleaner in Python 3.14. You no longer need to put multiple exception types in parentheses in an except clause when you’re not using an as alias. In previous versions, to catch multiple exception types, you would write:

try:
    do_something()
except (ValueError, TypeError):
    handle_error()

Now you can simply separate them with commas without parentheses:

try:
    do_something()
except ValueError, TypeError:
    handle_error()

This change (defined in PEP 758) makes the syntax for catching multiple exceptions more concise. It also applies to except* clauses (used for exception groups in asynchronous tasks) where you can omit the brackets there as well when not binding the exception object. While this is a small tweak, it improves code readability and is one less thing to remember when writing try/except blocks. It’s a straightforward quality-of-life improvement for developers.

(Note: If you do use as to name the exception, you still need parentheses around multiple exception types to avoid ambiguity.)

6. AsyncIO Task Introspection with asyncio ps and pstree

If you write or maintain asynchronous code, Python 3.14 brings a new tool to help debug and understand your async tasks. The asyncio module now has a command-line introspection interface that lets you inspect running asynchronous tasks in a live process. By running python -m asyncio ps (where is the process ID of a Python program using asyncio), you get a snapshot of all running tasks in that event loop. It will list each task, its name, its coroutine call stack, and which tasks (if any) are awaiting it. This is akin to a process listing (ps) but for asyncio tasks, helping you see what coroutines are active or stuck.

There’s also python -m asyncio pstree which displays the tasks in a tree structure, showing parent-child relationships between tasks (e.g., which task spawned or is awaiting which). This is especially useful for visualizing complex async workflows or diagnosing deadlocks in async code. For example, if tasks are awaiting each other and form a cycle, the tool will detect it and report the cycle.

Why this matters: debugging async applications (like web servers, crawlers, or any I/O-heavy concurrent program) has historically been challenging. This new introspection capability lets you peek inside a running async event loop to troubleshoot performance issues or logical bugs without stopping the program. It’s a built-in way to monitor and debug asyncio, which will be valuable in real-world scenarios such as identifying which coroutine is blocking your application.

7. Deferred Evaluation of Annotations (Lazy Type Hints)

Type annotations in Python 3.14 are now evaluated lazily by default, as specified in PEP 649 and PEP 749. In practice, this means that annotations on functions, classes, and modules are no longer executed at definition time, but stored for later evaluation only when needed. The immediate benefit is performance: defining functions with annotations is faster and has no side effects (previously, if an annotation referred to a name that wasn’t defined yet, you had to quote it or import it early). Now, you can freely use forward references in annotations without using string literals.

For example, you can define a self-referential type or mutually referential classes like this:

# Before Python 3.14: forward references had to be in quotes
class Tree:
    def __init__(self, parent: ‘Tree’ = None):
        self.parent = parent

# In Python 3.14: no quotes needed for forward references
class Tree:
    def __init__(self, parent: Tree = None):
        self.parent = parent

In the Python 3.14 version, the annotation parent: Tree won’t cause a NameError even though the class Tree isn’t fully defined at that point. The annotation is stored in a deferred form and can be resolved later (for instance, by tools like typing.get_type_hints() or the new annotationlib.get_annotations() module). This deferred evaluation improves runtime performance by avoiding work at import time, and simplifies development because you no longer need to add import hacks or quotes for forward-declared types.

For data scientists and developers, this “lazy” annotation behavior means you can add type hints more freely, even in complex module setups or circular dependencies. It reduces the friction of using type hints in large projects and lays the groundwork for more powerful type introspection utilities.

8. Parallel Subinterpreters for True Concurrency

Python 3.14 adds standard library support for subinterpreters (PEP 734), enabling a new model of parallelism. Subinterpreters are isolated Python interpreters within the same process, which you can think of as lightweight processes that can run in parallel on multiple CPU cores, but without the overhead of launching separate OS processes. The new concurrent.interpreters module and a high-level API InterpreterPoolExecutor in concurrent.futures let you easily run tasks in parallel interpreters.

Why is this exciting? Subinterpreters offer true multi-core parallelism while keeping a shared memory space (with explicit data passing). They are like threads in terms of efficiency, but unlike threads, they don’t share all state by default, which avoids the Global Interpreter Lock (GIL) contention and many concurrency headaches. In fact, you can think of multiple interpreters as having “the isolation of processes with the efficiency of threads.” For CPU-bound tasks, this can drastically improve performance by utilizing all cores without needing to spin up full separate processes for each task.

Using subinterpreters is straightforward for developers familiar with concurrent.futures. For example, you can use the new InterpreterPoolExecutor similarly to a ThreadPool or ProcessPool:

from concurrent.futures import InterpreterPoolExecutor

def compute_square(x):
    return x * x

with InterpreterPoolExecutor() as executor:
    results = list(executor.map(compute_square, range(5)))
    print(results)  # Output: [0, 1, 4, 9, 16]

Each task submitted to an InterpreterPoolExecutor runs in its own separate interpreter, so CPU-bound computations truly run in parallel across cores. The arguments and results are pickled under the hood (since subinterpreters don’t share objects), but subinterpreters start much faster and use less memory than spawning new processes. This feature will enable more scalable data processing and parallel algorithms in pure Python, without needing external libraries or leaving the comfort of the Python standard library.

(Keep in mind that some C extension modules may need updates to work in multiple interpreters, but all built-in modules have been made compatible. The community is actively improving support now that this feature is available.)

9. Free-Threaded Python (No GIL Mode)

Perhaps one of the most impactful changes in Python 3.14 is that a free-threaded (no-GIL) build of Python is now officially supported (PEP 703/779). This variant of the interpreter removes the Global Interpreter Lock, allowing truly parallel threads in the same process. In other words, CPU-bound Python code can potentially use multiple threads at the same time, accelerating workloads like numerical computations, data transformations, or any heavy processing that was limited by the GIL before.

In Python 3.13, an experimental no-GIL build was introduced, but it required opting in and was not officially supported. In 3.14, the no-GIL build continues to be an opt-in feature, but it is maintained as a fully supported part of CPython going forward. This means you can compile or install a no-GIL edition of Python 3.14 knowing that it will receive updates and won’t be dropped without warning. If you’re interested in trying it, you can enable the free-threaded mode and run your multi-threaded code to see significant speed-ups on multi-core machines.

It’s worth noting that the free-threaded build, in its current state, may run single-threaded code about 5-10% slower than the regular GIL build due to the overheads introduced by removing the GIL. However, for programs that can utilize multiple threads, the ability to run in parallel often more than makes up for this overhead. This is a huge step for Python in domains like scientific computing and data engineering, where multi-core utilization is key. With Python 3.14, we’re seeing the beginning of a no-GIL future: you can start experimenting with it today to speed up threaded workloads, without changing your Python code at all (just use the no-GIL build).

10. Experimental JIT Compiler in CPython

Python 3.14 takes a step towards boosting performance by including an experimental Just-In-Time (JIT) compiler in the official CPython distribution. In the Windows and macOS Python 3.14 installers, an optional JIT is now bundled (disabled by default). This JIT works by dynamically compiling portions of Python bytecode into machine code at runtime, aiming to accelerate execution of hot code paths. It complements the adaptive interpreter introduced in earlier versions by optimizing at a larger granularity – not just one bytecode at a time, but sequences of instructions.

To try out the JIT, you can enable it with an environment variable or command-line switch. For example, running your program with PYTHON_JIT=1 in the environment will turn on the JIT compiler. You can also use a -X flag (e.g. -X jit) when launching Python. When enabled, the JIT will monitor your code as it runs and compile parts of it to native code for speed. This can lead to significant speed-ups for long-running or compute-intensive applications – though because it’s experimental, the results may vary and not all workloads will see a benefit yet.

For developers, the message is that Python is getting faster, and you can opt into these improvements right away. If you have a performance-critical script, it may be worth benchmarking with the JIT enabled to see if it helps. As the JIT stabilizes in future releases, we can expect Python to require fewer hand-written C extensions or workarounds for speed. Python 3.14’s JIT is an early glimpse at these forthcoming gains in execution speed.

11. Tail-Call Optimized Bytecode Interpreter

Another under-the-hood improvement in Python 3.14 is a new tail-calling interpreter implementation for CPython. This isn’t a new feature you use in your code, but rather a change in how the Python interpreter executes bytecode. Instead of using one giant C switch statement for the main loop, the new interpreter uses tail calls between tiny functions that implement each opcode. For certain compilers and platforms, this approach has yielded a 3-5% overall speedup on the Python benchmark suite.

While a few percent may not sound like much, it’s a free performance boost that applies to all Python code. Especially in data science or server applications, even single-digit percentage improvements can translate to meaningful time savings over large workloads. The tail-call interpreter is currently an opt-in build (it requires a newish compiler like Clang 19+ and enabling a compile-time flag), so average users won’t see it unless they build Python from source with those options. However, its inclusion signals ongoing efforts to speed up CPython. It also lays groundwork for future compatibility as compilers evolve (GCC is expected to support this technique soon.

In summary, Python 3.14’s tail-call interpreter is purely an internal optimization. It doesn’t change Python’s semantics or require any code changes, but it shows that the Python core devs are squeezing out performance wherever possible. Over time, such improvements accumulate, making Python a bit faster with each release.

12. Incremental Garbage Collection

Python has an automatic garbage collector for cleaning up unused objects, especially those involved in reference cycles. In Python 3.14, the cyclic garbage collector has been improved to run incrementally, rather than in one big stop-the-world sweep. The result is that garbage collection pause times are dramatically reduced by an order of magnitude or more for large heaps. In practical terms, if your program allocates and releases a lot of objects (common in data processing, simulations, or servers that handle many requests), you should experience shorter delays when the GC runs, leading to smoother performance.

Previously, the GC might occasionally introduce noticeable pauses if there were a huge number of objects to examine, because it would try to process a lot of them in one go. With incremental GC, the work is broken into smaller chunks interleaved with normal execution, so your program doesn’t have to stop for as long at once. This is especially beneficial for applications that require responsiveness or have real-time constraints – for example, a data pipeline that ingests data continuously will see more consistent throughput, and an interactive application will remain more responsive even under heavy memory load.

As a developer or data scientist, you don’t need to do anything to reap this benefit – it’s an automatic improvement. Your Python 3.14 programs will likely “feel” snappier under memory pressure. This change is another example of Python 3.14 refining existing machinery (in this case, memory management) to be more efficient and robust in real-world use.

Conclusion

Python 3.14 bring new polish with features that are useful for our everyday work. In this article, we have covered the new features released:

Colorized interactive REPL and colored stdlib CLIs
Clearer, suggestion-rich error messages
Safe live debugging: attach to running processes
Template strings (“t-strings”) for controlled interpolation
Cleaner except syntax for multiple exceptions (no parens)
AsyncIO introspection: python -m asyncio ps / pstree
Deferred evaluation of annotations (lazy type hints)
Subinterpreters + InterpreterPoolExecutor for true parallelism
Free-threaded (no-GIL) CPython build (opt-in)
Experimental JIT compiler in CPython (opt-in)
Tail-call-based bytecode interpreter (internal speedup)
Incremental garbage collection (shorter GC pauses)

I hope it has helped!

Like this article? Don’t forget to comment and share.

Leave a comment

Building AI-Ready Data

Cornellius Yudha Wijaya — Fri, 17 Oct 2025 05:45:33 GMT

Image by Author | Ideogram.ai

We often think that AI breakthroughs mean advanced algorithms and complex model architectures. However, the real secret to AI success isn’t just the model; it’s the data behind it. In fact, advanced AI models are just the “tip of the iceberg,” and 90% of success lies in data foundations.

In other words, AI initiatives mainly depend on the quality and readiness of data. As one expert put it, “AI models are only as good as the data they’re trained on,” because bad data leads to unreliable predictions.

Without AI-ready data, even the best algorithms will stumble.

We can think of them as a pyramid of needs for AI, where the base layers are the foundation, including data collection, cleaning, and organization, which must be solid before the pinnacle (the AI model) can work best.

So, how could we build AI-ready data? We will explore them in the next section.

Subscribe now

The AI-Ready Data Tooling Example

Great models need significant inputs. Think in layers, pick one tool per layer to solve a real bottleneck, then expand as needs grow.

Here are a few tool examples you can reference:

Conclusion

AI success relies more on solid data foundations than on flashy models: without AI-ready data that is high-quality, well-governed, enriched with metadata, and structured for reuse, the algorithms cannot succeed.

Achieving this requires an iterative lifecycle: collect, clean, structure, annotate, and integrate data; run quality checks; and enforce governance and security, with optional semantic alignment.

Use one fit-for-purpose tool per layer to address bottlenecks, expanding as necessary. Investing in this foundation enhances model reliability, explanation, and scalability across the business.

I hope it has helped!

Like this article? Don’t forget to comment and share!

Leave a comment

10 Lessons Learned from Building Predictive Models

Cornellius Yudha Wijaya — Mon, 13 Oct 2025 13:29:47 GMT

Image by Author | Ideogram.ai

Building predictive models is not just a technical or statistical task; it's an ongoing learning process that combines data engineering, business insight, and product thinking. Each project offers lessons that improve how you approach the next one.

In my experience leading end-to-end predictive modeling projects, I have noted 10 insights that go beyond algorithms and metrics. These lessons reflect both the analytical capability and the practical realities of deploying models that create measurable impact.

Curious about it? Let’s get into it!

Subscribe now

Common Challenges in Operationalizing Models (and How to Overcome Them)

Cornellius Yudha Wijaya — Sun, 05 Oct 2025 15:00:55 GMT

Image by Author | Ideogram.ai

Bringing a predictive model from the controlled environment of a prototype into the world of production is rarely a smooth journey.

While building a model in a notebook may take days or weeks, operationalizing it often exposes deeper issues beyond data science. These challenges can stall business progress and hinder opportunities that the company should have.

Many machine learning projects face common challenges that you should understand when deploying your model into production. That’s why this article will discuss typical challenges encountered in operationalizing models and the solutions to overcome them.

Curious about it? Let’s get into it!

Subscribe now

Sponsor Section

Packt is currently giving away a FREE E-Book for your learning:

• Learn Python Programming
• Mathematics of Machine Learning
• Mastering Power BI

All bundled with a FREE newsletter. Don’t miss them here:

PACKT FREE E-Book

1. Data Pipeline and Quality Issues

The integrity of data pipelines is essential for the success of any predictive modelling system. In practice, many projects face performance issues after deployment, not because of flawed algorithms, but because the production data feeding those models differs from the data used during training.

For example, issues such as discrepancies in data structure, missing values, delayed updates, or unrecorded schema changes can lead to silent failures, which distort model outputs and undermine stakeholder trust.

To reduce these risks, here are a few things you can do:

Implement end-to-end data validation
Perform quality checks at each stage of the pipeline, from ingestion to transformation and storage, to verify completeness, consistency, and validity.
Use automated validation frameworks
Automated data validation frameworks, such as Great Expectations or TensorFlow Data Validation, can help detect anomalies before they affect production components.
Maintain data lineage and versioning
Conduct regular data flow audits and maintain version histories to trace the origin and evolution of training features.
Strengthen communication between teams
Foster collaboration between data engineering and data science teams so upstream changes, such as new collection methods or revised definitions, are quickly addressed.
Establish clear documentation and schema registries
Maintain centralized schema definitions and data documentation to ensure consistency among sources, transformations, and models.
Treat data as a managed asset
Manage data with the same discipline as software assets. Stable, well-governed data pipelines build the foundation for reliable and scalable predictive systems.

Ultimately, developing a quality data pipeline and resolving quality issues will establish a stable foundation on which predictive models can operate reliably and be scaled confidently.

2. Reproducibility and Version Control

Reproducibility is a key principle in implementing predictive models. It guarantees that each step can be repeated with the same results when using the same inputs and environment.

This principle is often violated in many companies when experiments are conducted without standard tracking datasets, feature changes, hyperparameters, or library versions. These oversights usually hinder model validation, making it harder to identify the causes of performance differences between development and production environments.

To help mitigate this problem, you can use the follow tips:

Standardize experiment tracking
C traceable records, use structured logging of data sources, parameters, and model outcomes. To manage your experiments, you can use MLflow and Weights & Biases.
Version both code and data
Maintain all scripts and datasets under version control to ensure reproducibility of training conditions.
Common tools: Git for code, DVC or LakeFS for dataset versioning.
Ensure environmental consistency
Use containerization to guarantee that models execute within the same software environment across development and production. You can use Docker to help the consistency proces.
Adopt governance standards
Document experiment results, model versions, and approval processes. Assign clear ownership for maintaining reproducibility practices.

In the end, establishing reproducibility and version control is both a technical safeguard and a governance requirement. These practices strengthen transparency and accountability and help ensure that predictive systems remain reliable.

3. Scalability and Performance Constraints

A predictive model that performs well in experimental settings may not sustain the same level of efficiency once deployed in production.

The shift from offline testing to real-time or large-scale settings often reveals hidden inefficiencies in computation, memory management, and data throughput. For example, models that perform within seconds on small samples during development can become problematic when required to process millions of data points within milliseconds.

To address these challenges, here are a few tips to follow:

Design for scalability from the outset
Anticipate production requirements early to avoid structural limitations that are difficult to resolve later.
Profile performance early
Use profiling tools to detect bottlenecks in training and inference before deployment.
Simplify complex models
Reduce computational overhead through pruning, quantization, or other optimization techniques without compromising accuracy.
Match infrastructure to the use case
Allow a distributed system for real-time tasks and parallelized pipelines for batch processing.
Test under realistic conditions
Validate responsiveness and stability with production-scale data and workloads.
Monitor and optimize continuously
Track latency, throughput, and resource utilization to maintain consistent performance as data and traffic increase.

Achieving scalability is a matter of increasing computational power and designing systems that balance all the essential components. It’s an important issues that need to be consider everytime we talking about production.

4. Model Degradation and Concept Drift

Predictive performance can decline after deployment because the data-generating process changes over time.

Two patterns are most common:

Data drift occurs when the distribution of input features shifts compared with the training data, and
Concept drift occurs when the relationship between inputs and the target outcome changes.

Both effects diminish the validity of learned parameters and can result in unstable or biased decisions if not managed.

To mitigate them, here are a few tips you can follow:

Define reference baselines
Preserve training snapshots, feature statistics, and performance metrics for comparison.
Monitor continuously
Track input and prediction distributions, calibration, and task metrics.
You can use tools such as Evidently AI or Alibi Detect. You can also use major cloud monitors (e.g., SageMaker Model Monitor, Vertex AI Model Monitoring, Azure ML Data Drift).
Alert and diagnose
Establish thresholds, then localize issues to specific features, segments, and time windows.
Retrain and validate
Use recent data for periodic or event-driven retraining, and apply windowed training or incremental learning. Validate with backtesting and fresh holdouts.
Control deployment risk
Release updates through shadow, canary, or A/B testing; ensure a clear rollback plan.
Harden data pipelines
Enforce schema validation, maintain unit consistency, control categories, and ensure data freshness SLAs.
Document governance
Log drift events, criteria, model versions, approvals, and ownership for monitoring and response.

Do not sleep on the model degradation and concept drift for a reliable production system.

Like this article? Don’t forget to share and comment.

Leave a comment

Non-Brand Data

What Real SQL Work Taught Me About Being a Data Scientist

I did not start by taking SQL seriously

Work forced the lesson

What real SQL work actually looked like

1. Ad-hoc reporting taught me that simple requests are rarely simple

2. Metric definition matters more than query complexity

3. Combining data sources is harder than it looks

4. Even Python-heavy data science often begins with SQL

What I value in SQL work now

What I would tell aspiring data scientists now

Best Stock Market data API in the AI Agent era

Alpha Vantage

Overview

What makes it valuable in the AI Agent era?

In short

Tradier

Overview

What makes it valuable in the AI Agent era?

In short

Xignite

Overview

What makes it valuable in the AI Agent era?

In short

EOD Historical Data

Overview

What makes it valuable in the AI Agent era?

In short

QuoteMedia

Overview

What makes it valuable in the AI Agent era?

In short

Conclusion

7 SQL Use Cases Every Data Professional Should Know

1. KPI reporting

2. Funnel analysis

3. Cohort retention analysis

4. Segmentation

5. Experiment analysis

6. Data quality and QA checks

7. Operational monitoring

The bigger point

Where to go next

Start here

Cohort Retention in SQL

Start with the question, not the query

The version we’re building here

Sample data

Template Pack Index (Paid)

NBD Reading Vault (Paid): Guided Paths + Mini-Projects

✨Subscriber Benefits

Non-Brand Data Subscriber Benefits

Full version

Free subscribers

Paid members

Founding members

One-time purchases (optional)

How to redeem your benefits

All subscribers

Paid members

Founding members

Creating a Daily Bulk Ingestion Pipeline for Historical Price Data and Fundamentals

Foundation

The Data Source

Project structure

Step-by-Step Walkthrough

Step 1: define dependencies and configuration

Step 2: Establish a single configuration contract

Step 3: Implement a Stable API client

Step 4: define the schema and write for the data storage

Step 5: Convert API responses into data rows

Step 6: Create runnable entry points

Step 7: The database generation

Step 8: Scheduling (local or GitHub Actions)

Running the scripts

1. Install dependencies

2. Configure .env

3. Seed symbols into the database

4. Backfill historical prices

5. Run the daily ingestion job

2. Configure `.env`

Step 2: Add a `.env` file for configuration

Step 3: Load settings once in `app/config.py`

Step 4: Build a small in `app/fmp_client.py`

5. Implement the full pipeline in `app/engine.py`

Step 6: Render the one-page brief in `app/render_markdown.py`