On-Chain Series VIII: Scaling Transaction Indexing with PostgreSQL and Hybrid Storage Architecture

Building the foundation for comprehensive blockchain analytics at CoinDesk Data

Following our journey from optimizing block data retrieval with NFS to leveraging Azure's L-Series machines for efficient processing, and most recently introducing our Address Metadata Endpoint, we're excited to share the next major milestone in our infrastructure evolution: comprehensive transaction indexing by blockchain address at scale.

This advancement represents a fundamental shift in how we approach blockchain data , enabling developers to build everything from portfolio tracking tools to sophisticated blockchain analytics platforms. Today, we're diving deep into the technical architecture that makes it possible to efficiently query, paginate and analyze transaction histories for any address across multiple blockchains.

The Scale Challenge

Bitcoin: 1.42 billion addresses, 1.26 billion transactions
Ethereum: 325 million addresses, 2.86 billion transactions
Solana: 2.32 billion addresses, 419 billion transactions

When it comes to handling vast amounts of data, the scale of blockchain networks presents unique hurdles. Traditional approaches either couldn't handle this or came with prohibitive costs. We needed a solution that was both performant and economically viable for a data company operating at our scale.

The Storage Challenge

After extensive evaluation of various database technologies—including BigQuery, Snowflake, ClickHouse, and specialized blockchain databases—PostgreSQL emerged as the clear winner. But not for the reasons you might expect.

The API Cost Reality: At our scale of millions of API calls per month, usage-based pricing becomes prohibitive. Google BigQuery costs ~$5 per TB queried. With our normalized datasets ranging from 254GB (Bitcoin) to 61TB (Solana), even simple address queries would be costly (per thousand calls). At millions of monthly API calls, we'd be looking at $50,000-$500,000+ in query costs alone. Compare this to PostgreSQL infrastructure costs of $5,000-$15,000 monthly for equivalent performance.

Query Latency for Real-Time APIs: Our users expect sub-100ms API responses. BigQuery's cold start penalty means even simple queries take 2-10 seconds. Snowflake has similar latency issues for small, frequent queries. PostgreSQL with proper indexing consistently delivers 10-50ms response times for our specific query patterns—transaction lookups by address with pagination.

Write Performance for Blockchain Ingestion: Blockchain data arrives in real-time bursts. New blocks every 12 seconds (Ethereum) or 400ms (Solana) require sustained write performance. BigQuery has streaming insert limits of 100,000 rows/second and charges $0.01 per 200MB. PostgreSQL handles 1M+ sustained inserts/second on proper hardware without additional costs. For Solana's transaction volume, BigQuery's streaming costs alone would exceed $50,000 monthly.

Complex Join Performance: Our normalization strategy requires efficient multi-table joins. PostgreSQL's cost-based optimizer and B-tree indexes excel at the address_id → transaction_id → hash lookups that power our pagination. Analytical databases optimize for columnar scans, not the row-level lookups and joins that APIs require.

What PostgreSQL Loses: We're honest about the trade-offs. ClickHouse would crush us on analytical aggregations across full datasets. BigQuery's columnar compression would reduce our storage costs by 60-80%. For pure analytics workloads, we'd choose differently. But we're building APIs, not running occasional analytics queries.

The Technical Truth: PostgreSQL won because at millions of API calls with sub-100ms latency requirements, operational costs matter more than theoretical performance. The best database is the one you can afford to run at scale.

The Hybrid Storage Architecture

Our solution combines the best of both worlds: PostgreSQL for indexing and relationships, Cloudflare R2 for blob storage.

PostgreSQL: The Index Layer

We store three normalized tables per blockchain:‍

-- Address normalization (saves 24-46% storage vs denormalized)
CREATE TABLE address_eth (
    id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
    blockchain_address TEXT NOT NULL UNIQUE
);

-- Transaction references with auto-incrementing IDs
CREATE TABLE transaction_eth (
    id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
    hash TEXT NOT NULL UNIQUE
);

-- The junction table that makes everything work
CREATE TABLE address_transaction_eth (
    address_id BIGINT NOT NULL,
    transaction_id BIGINT NOT NULL,
    timestamp TIMESTAMP NOT NULL,
    PRIMARY KEY (address_id, transaction_id),
    FOREIGN KEY (address_id) REFERENCES address_eth(id),
    FOREIGN KEY (transaction_id) REFERENCES transaction_eth(id)
);

This normalization delivers massive space savings. For Ethereum alone, we reduced storage from 825.87 GB (denormalized) to 449.12 GB (normalized)—a 45.6% reduction. When popular exchange addresses participate in millions of transactions, storing the full address string repeatedly becomes prohibitively expensive.

Cloudflare R2: The Detail Layer

While PostgreSQL handles the indexing and relationships efficiently, we store the full transaction details as compressed JSON blobs in Cloudflare R2. Each transaction gets a blob identified by its hash (0x5c504ed432cb51138bcf09aa5e8a410dd4a1e204ef84bfed1be16dfba1b22060.json.gz):

This hybrid approach means:

Fast pagination through PostgreSQL indices
Rich transaction details when needed from R2
Cost optimization by keeping expensive detailed data in object storage
Flexibility to evolve transaction schemas without database migrations

Horizontal Sharding Strategy

Different blockchains require different sharding strategies based on their scale:

Bitcoin (1 shard): At 254GB total, a single PostgreSQL instance handles the load comfortably. Bitcoin's lower transaction throughput doesn't require horizontal scaling yet.

Ethereum (5 shards): With ~90GB per shard, we achieve the sweet spot between performance and operational complexity. Address-based sharding using hash(address) % 5 ensures natural distribution.

Solana (10 shards): The massive 61.75TB dataset necessitates aggressive sharding. Each shard manages ~6.2TB, keeping individual PostgreSQL instances within manageable limits.‍

const getShardConnection = (assetId, address) => {
 const shardCounts = { 1: 1, 2: 5, 3: 10 }; // BTC: 1, ETH: 5, SOL: 10
 const shardId = crypto.createHash('sha256')
   .update(address)
   .digest()
   .readUInt32BE(0) % shardCounts[assetId];
 return `PG_BC_ADDRESS_COLLECT_${(shardId + 1).toString().padStart(2, '0')}`;
};

Efficient Pagination at Scale

The real magic happens in our pagination strategy. Instead of traditional offset-based pagination (which becomes slow at large offsets), we use cursor-based pagination with our auto-incrementing transaction IDs:‍

// Get latest 100 transactions for an address
const getLatestTransactions = async (assetId, address, limit = 100) => {
 const shard = getShardConnection(assetId, address);
 return await shard.query(`
   SELECT t.id, t.hash, at.timestamp
   FROM address_transaction_${assetId} at
   JOIN address_${assetId} a ON at.address_id = a.id
   JOIN transaction_${assetId} t ON at.transaction_id = t.id
   WHERE a.blockchain_address = $1
   ORDER BY at.timestamp DESC, t.id DESC
   LIMIT $2
 `, [address, limit]);
};

// Get next page using cursor
const getTransactionsBefore = async (assetId, address, beforeId, limit = 100) => {
 const shard = getShardConnection(assetId, address);
 return await shard.query(`
   SELECT t.id, t.hash, at.timestamp
   FROM address_transaction_${assetId} at
   JOIN address_${assetId} a ON at.address_id = a.id
   JOIN transaction_${assetId} t ON at.transaction_id = t.id
   WHERE a.blockchain_address = $1
   AND t.id < $2
   ORDER BY at.timestamp DESC, t.id DESC
   LIMIT $3
 `, [address, beforeId, limit]);
};

This approach maintains consistent performance regardless of how deep users paginate through transaction histories.

Application-Level Data Federation

When users need full transaction details, our application layer federates data from PostgreSQL and Cloudflare R2:‍

const getTransactionDetails = async (transactionHashes) => {
 // Batch fetch from R2 for efficiency
 const r2Promises = transactionHashes.map(hash => 
   fetchFromR2(`${hash}.json.gz`)
 );
 const transactionDetails = await Promise.all(r2Promises);
 return transactionDetails.map(detail => JSON.parse(detail));
};

const getAddressTransactionsWithDetails = async (assetId, address, limit) => {
 // Get transaction references from PostgreSQL
 const transactions = await getLatestTransactions(assetId, address, limit);
 
 // Fetch full details from R2
 const details = await getTransactionDetails(
   transactions.map(tx => tx.hash)
 );
 
 // Merge at application level
 return transactions.map((tx, index) => ({
   ...tx,
   details: details[index]
 }));
};

The Building Blocks for Blockchain Analytics

This infrastructure provides three foundational APIs that enable comprehensive blockchain analysis:

Transaction Indexing: Complete, paginated transaction histories for any address
Block Data: Full block information by number (available on our API)
Address Metadata: Comprehensive address insights (covered in Series VI, available on our API)

Together, these APIs provide everything needed to build:

Portfolio Tracking Tools: Real-time and historical balance calculations across all addresses
Risk Analysis Platforms: Transaction pattern analysis and compliance monitoring
DeFi Analytics: Liquidity tracking, yield farming analysis, and protocol interactions
Institutional Analytics: Large-scale blockchain data analysis for financial institutions

Performance Benchmarks

Our production deployment delivers:

Sub-50ms response times for address transaction queries (first page)
Consistent pagination performance regardless of offset depth
99.9% uptime across all blockchain networks
Linear scaling as we add new blockchains and shards

Looking Forward

This transaction indexing infrastructure is already powering several internal tools at CoinDesk Data, and we're preparing to expose these capabilities through our public API in the coming months.

The combination of cost-efficient PostgreSQL sharding, intelligent data federation, and hybrid storage represents a scalable foundation that will grow with the blockchain ecosystem. As new chains emerge and transaction volumes continue to explode, our architecture can adapt without fundamental changes.

In our next post, we'll explore how we're using this transaction indexing infrastructure to power real-time portfolio analytics and cross-chain analysis tools. The ability to efficiently query transaction histories at scale opens up entirely new possibilities for blockchain data products.