Blog
In this blog, we introduce our latest infrastructure innovation: comprehensive transaction indexing by blockchain address at scale, a hybrid storage architecture designed to enable efficient querying and pagination of transaction histories across multiple blockchain networks.
Building the foundation for comprehensive blockchain analytics at CoinDesk Data
Following our journey from optimizing block data retrieval with NFS to leveraging Azure's L-Series machines for efficient processing, and most recently introducing our Address Metadata Endpoint, we're excited to share the next major milestone in our infrastructure evolution: comprehensive transaction indexing by blockchain address at scale.
This advancement represents a fundamental shift in how we approach blockchain data , enabling developers to build everything from portfolio tracking tools to sophisticated blockchain analytics platforms. Today, we're diving deep into the technical architecture that makes it possible to efficiently query, paginate and analyze transaction histories for any address across multiple blockchains.
When it comes to handling vast amounts of data, the scale of blockchain networks presents unique hurdles. Traditional approaches either couldn't handle this or came with prohibitive costs. We needed a solution that was both performant and economically viable for a data company operating at our scale.
After extensive evaluation of various database technologies—including BigQuery, Snowflake, ClickHouse, and specialized blockchain databases—PostgreSQL emerged as the clear winner. But not for the reasons you might expect.
The API Cost Reality: At our scale of millions of API calls per month, usage-based pricing becomes prohibitive. Google BigQuery costs ~$5 per TB queried. With our normalized datasets ranging from 254GB (Bitcoin) to 61TB (Solana), even simple address queries would be costly (per thousand calls). At millions of monthly API calls, we'd be looking at $50,000-$500,000+ in query costs alone. Compare this to PostgreSQL infrastructure costs of $5,000-$15,000 monthly for equivalent performance.
Query Latency for Real-Time APIs: Our users expect sub-100ms API responses. BigQuery's cold start penalty means even simple queries take 2-10 seconds. Snowflake has similar latency issues for small, frequent queries. PostgreSQL with proper indexing consistently delivers 10-50ms response times for our specific query patterns—transaction lookups by address with pagination.
Write Performance for Blockchain Ingestion: Blockchain data arrives in real-time bursts. New blocks every 12 seconds (Ethereum) or 400ms (Solana) require sustained write performance. BigQuery has streaming insert limits of 100,000 rows/second and charges $0.01 per 200MB. PostgreSQL handles 1M+ sustained inserts/second on proper hardware without additional costs. For Solana's transaction volume, BigQuery's streaming costs alone would exceed $50,000 monthly.
Complex Join Performance: Our normalization strategy requires efficient multi-table joins. PostgreSQL's cost-based optimizer and B-tree indexes excel at the address_id → transaction_id → hash lookups that power our pagination. Analytical databases optimize for columnar scans, not the row-level lookups and joins that APIs require.
What PostgreSQL Loses: We're honest about the trade-offs. ClickHouse would crush us on analytical aggregations across full datasets. BigQuery's columnar compression would reduce our storage costs by 60-80%. For pure analytics workloads, we'd choose differently. But we're building APIs, not running occasional analytics queries.
The Technical Truth: PostgreSQL won because at millions of API calls with sub-100ms latency requirements, operational costs matter more than theoretical performance. The best database is the one you can afford to run at scale.
Our solution combines the best of both worlds: PostgreSQL for indexing and relationships, Cloudflare R2 for blob storage.
We store three normalized tables per blockchain:
-- Address normalization (saves 24-46% storage vs denormalized)
CREATE TABLE address_eth (
id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
blockchain_address TEXT NOT NULL UNIQUE
);
-- Transaction references with auto-incrementing IDs
CREATE TABLE transaction_eth (
id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
hash TEXT NOT NULL UNIQUE
);
-- The junction table that makes everything work
CREATE TABLE address_transaction_eth (
address_id BIGINT NOT NULL,
transaction_id BIGINT NOT NULL,
timestamp TIMESTAMP NOT NULL,
PRIMARY KEY (address_id, transaction_id),
FOREIGN KEY (address_id) REFERENCES address_eth(id),
FOREIGN KEY (transaction_id) REFERENCES transaction_eth(id)
);
This normalization delivers massive space savings. For Ethereum alone, we reduced storage from 825.87 GB (denormalized) to 449.12 GB (normalized)—a 45.6% reduction. When popular exchange addresses participate in millions of transactions, storing the full address string repeatedly becomes prohibitively expensive.
While PostgreSQL handles the indexing and relationships efficiently, we store the full transaction details as compressed JSON blobs in Cloudflare R2. Each transaction gets a blob identified by its hash (0x5c504ed432cb51138bcf09aa5e8a410dd4a1e204ef84bfed1be16dfba1b22060.json.gz):
This hybrid approach means:
Different blockchains require different sharding strategies based on their scale:
Bitcoin (1 shard): At 254GB total, a single PostgreSQL instance handles the load comfortably. Bitcoin's lower transaction throughput doesn't require horizontal scaling yet.
Ethereum (5 shards): With ~90GB per shard, we achieve the sweet spot between performance and operational complexity. Address-based sharding using hash(address) % 5
ensures natural distribution.
Solana (10 shards): The massive 61.75TB dataset necessitates aggressive sharding. Each shard manages ~6.2TB, keeping individual PostgreSQL instances within manageable limits.
const getShardConnection = (assetId, address) => {
const shardCounts = { 1: 1, 2: 5, 3: 10 }; // BTC: 1, ETH: 5, SOL: 10
const shardId = crypto.createHash('sha256')
.update(address)
.digest()
.readUInt32BE(0) % shardCounts[assetId];
return `PG_BC_ADDRESS_COLLECT_${(shardId + 1).toString().padStart(2, '0')}`;
};
The real magic happens in our pagination strategy. Instead of traditional offset-based pagination (which becomes slow at large offsets), we use cursor-based pagination with our auto-incrementing transaction IDs:
// Get latest 100 transactions for an address
const getLatestTransactions = async (assetId, address, limit = 100) => {
const shard = getShardConnection(assetId, address);
return await shard.query(`
SELECT t.id, t.hash, at.timestamp
FROM address_transaction_${assetId} at
JOIN address_${assetId} a ON at.address_id = a.id
JOIN transaction_${assetId} t ON at.transaction_id = t.id
WHERE a.blockchain_address = $1
ORDER BY at.timestamp DESC, t.id DESC
LIMIT $2
`, [address, limit]);
};
// Get next page using cursor
const getTransactionsBefore = async (assetId, address, beforeId, limit = 100) => {
const shard = getShardConnection(assetId, address);
return await shard.query(`
SELECT t.id, t.hash, at.timestamp
FROM address_transaction_${assetId} at
JOIN address_${assetId} a ON at.address_id = a.id
JOIN transaction_${assetId} t ON at.transaction_id = t.id
WHERE a.blockchain_address = $1
AND t.id < $2
ORDER BY at.timestamp DESC, t.id DESC
LIMIT $3
`, [address, beforeId, limit]);
};
This approach maintains consistent performance regardless of how deep users paginate through transaction histories.
When users need full transaction details, our application layer federates data from PostgreSQL and Cloudflare R2:
const getTransactionDetails = async (transactionHashes) => {
// Batch fetch from R2 for efficiency
const r2Promises = transactionHashes.map(hash =>
fetchFromR2(`${hash}.json.gz`)
);
const transactionDetails = await Promise.all(r2Promises);
return transactionDetails.map(detail => JSON.parse(detail));
};
const getAddressTransactionsWithDetails = async (assetId, address, limit) => {
// Get transaction references from PostgreSQL
const transactions = await getLatestTransactions(assetId, address, limit);
// Fetch full details from R2
const details = await getTransactionDetails(
transactions.map(tx => tx.hash)
);
// Merge at application level
return transactions.map((tx, index) => ({
...tx,
details: details[index]
}));
};
This infrastructure provides three foundational APIs that enable comprehensive blockchain analysis:
Together, these APIs provide everything needed to build:
Our production deployment delivers:
This transaction indexing infrastructure is already powering several internal tools at CoinDesk Data, and we're preparing to expose these capabilities through our public API in the coming months.
The combination of cost-efficient PostgreSQL sharding, intelligent data federation, and hybrid storage represents a scalable foundation that will grow with the blockchain ecosystem. As new chains emerge and transaction volumes continue to explode, our architecture can adapt without fundamental changes.
In our next post, we'll explore how we're using this transaction indexing infrastructure to power real-time portfolio analytics and cross-chain analysis tools. The ability to efficiently query transaction histories at scale opens up entirely new possibilities for blockchain data products.
Get our latest research, reports and event news delivered straight to your inbox.