Blockchains are terrible databases. That's not an insult — it's a design tradeoff. A blockchain is optimized for tamper-resistance, not retrieval. The data is organized by block and transaction, not by the questions anyone actually wants to ask.
If you want to know the token balance of every address that interacted with a specific smart contract over the last 90 days, you'd technically need to replay every block from deployment and track state changes manually. That's not a hypothetical — it's exactly the kind of query that makes raw blockchain data impractical for applications.
Blockchain indexers exist to solve this mismatch. They sit between the raw chain and the applications that need to read it, processing data into structured, queryable formats in real time.
A blockchain node stores data in a format designed for consensus and verification. Blocks contain ordered lists of transactions. Transactions contain encoded function calls and value transfers. To find anything specific — say, all NFT transfers from a given address, or the historical APY of a lending pool — you'd need to decode the raw bytecode, parse the logs, and aggregate across thousands of blocks.
This is possible but slow. For production applications — wallets, analytics dashboards, DeFi protocols — querying this way introduces latency that makes the product unusable. A DEX frontend can't wait 30 seconds for a page to load.
Indexers solve this by doing the hard work ahead of time. They continuously read new blocks, decode and parse the relevant events and state changes, and write them into a conventional database — typically PostgreSQL or a similar structure — organized in a way that supports fast reads. When an application needs data, it queries the indexer, not the blockchain directly.
The basic architecture has four components.
Chain connection. The indexer connects to a blockchain node (or a node provider) via RPC and subscribes to new blocks. As each block is finalized, the indexer reads its contents.
Event parsing. Smart contracts emit events — structured logs that record significant state changes. An ERC-20 token transfer, for example, emits a Transfer(address from, address to, uint256 value) event. The indexer decodes these logs using the contract's ABI (the interface specification that defines what events and functions look like) and extracts the meaningful fields.
State tracking. Some data can't be reconstructed from events alone — it requires calling view functions on the contract to read current state. Indexers may periodically snapshot this data and track changes over time.
Database write. Decoded, structured data gets written into a queryable database. The indexer maintains a mapping from on-chain activity to database records, which applications then query via a standard API — usually GraphQL or REST.
When a reorg occurs (a chain reorganization where blocks are replaced), a well-built indexer needs to detect it, roll back any affected records, and reprocess the correct blocks. This is one of the harder engineering problems in indexing, and not all implementations handle it cleanly.
There are two broad approaches, and they carry meaningfully different trust assumptions.
Centralized indexers (Alchemy, Moralis, Etherscan's underlying data layer) run as managed services. A company operates the infrastructure, indexes the chains, and exposes the data via API. The advantage is reliability and speed. The tradeoff is trust: you're relying on the provider's infrastructure to return accurate, uncensored data. Most consumer applications use centralized indexers because they're easier to integrate and predictably performant.
Decentralized indexers — the paradigm The Graph Protocol is built around — distribute the indexing work across a network of independent operators. In The Graph's model, developers publish subgraphs (indexing schemas that define which contracts and events to track), and a decentralized network of indexers (node operators) process queries and get paid in GRT tokens for their work. Curators signal which subgraphs are worth indexing by staking GRT on them. Delegators stake GRT toward indexers to share in rewards without running infrastructure.
The design goal is to make indexed blockchain data verifiable and censorship-resistant — any indexer returning falsified data can be challenged and slashed. In practice, most high-stakes production applications still use centralized providers for latency reasons, while The Graph's decentralized network is more commonly used for DeFi applications and community-built tooling where trust assumptions matter more.
The hard constraints here are mostly technical. Every indexer has some latency — data is never truly real-time, only near-real-time. The lag is typically seconds for finalized chains, longer for chains with probabilistic finality. Applications that need sub-second freshness can't rely on indexed data for the lowest-latency queries.
Chain reorganizations remain a structural challenge. Shallow reorgs (one or two blocks) happen regularly on most chains. Deeper reorgs are rare but possible. An indexer that doesn't handle reorgs gracefully will serve stale or incorrect data without signaling that it's doing so — which is worse than an outage.
Trust is also a constraint. For centralized providers, the guarantee that data is accurate is purely contractual and reputational. There's no cryptographic proof that a centralized indexer hasn't modified what it's serving. For most applications this is an acceptable tradeoff; for others, it isn't.
The Graph has been expanding its multi-chain support — the decentralized network now covers Ethereum, Arbitrum, Optimism, Polygon, and a growing list of chains. The question of whether decentralized indexing can compete with centralized providers on performance has been a sustained engineering challenge; improvements in query routing and indexer hardware have narrowed the gap, but it hasn't closed.
A newer development worth watching: streaming indexers — tools like Ponder and Envio — designed for real-time event processing rather than batch indexing. These are targeted at applications where latency matters more than historical depth. The ecosystem is fragmenting into specialized solutions rather than consolidating around a single approach.
There's also quiet movement toward on-chain data availability improvements (EIP-4844 and future Ethereum roadmap items) that may eventually reduce the data that needs to be indexed externally. That's a longer horizon.
Continued growth in subgraph deployments on The Graph's decentralized network. Narrowing performance gap between decentralized and centralized indexers for standard query types. Major DeFi applications migrating from centralized providers to decentralized alternatives for trust-sensitive data.
Persistent latency disadvantage preventing decentralized indexing from capturing production workloads beyond community tooling. A significant data falsification event at a major centralized provider, which would accelerate decentralized alternatives but also damage broader trust in indexed data. Ethereum state changes that make current indexing approaches obsolete.
Now: Centralized indexers (Alchemy, Moralis, The Graph's hosted service) are the dominant infrastructure for most production applications. The tradeoffs are understood and the tools are mature.
Next: Decentralized indexing competition intensifies. Streaming indexers gain adoption for real-time use cases. Multi-chain indexing complexity grows as the number of active chains increases.
Later: If Ethereum's roadmap delivers meaningful on-chain data availability improvements, the indexing landscape could shift structurally. That's speculative at current progress rates.
This covers the indexing mechanism and the infrastructure layer. It doesn't address how to build a subgraph, how to choose between providers for a specific use case, or the economics of The Graph's GRT token. Those are separate questions.
The mechanism works as described. Whether a specific indexer provider is appropriate for a specific application depends on the application's trust requirements, latency needs, and chain coverage — factors outside the scope of this explanation.




