Use a hash to ensure the integrity of an arbitrarily large dataset that may not fit directly on the blockchain.
The integrity of a large datum or a large collection of data (that may not fit onto a blockchain transaction) or dynamic data needs to be preserved.
The blockchain, due to its full replication across all participants of the blockchain network, has limited storage capacity. Storing large volumes of data within a transaction may be impossible due to the limited size of the transaction and blocks of the blockchain. For example, Ethereum has a block gas limit to determine the number, computational complexity, and data size of the transactions included in the block. Also, the throughput could be limited. Data cannot take advantage of the immutability or integrity guarantees without being stored on the blockchain. How to preserve the integrity of a large set of data or dynamic data?
- Integrity – Data integrity should be preserved. If changes are allowed, the integrity of data updates should also be preserved.
- Scalability – A blockchain’s throughput and its data-carrying capacity are constrained by a set of factors such as transaction size, block size, and inter-block size. Utilising blockchain to record every data change may result in the accumulation of transactions in the transaction pool.
- Cost – In a permissionless blockchain network, transaction fees need to be paid. Thus, frequently storing data and accumulating large datasets on a blockchain is expensive. Even in a permissioned blockchain, each full node maintains a replica of all historical transactions increasing the cost of physical storage. While storing data in a contract is more efficient to enable manipulation, it can be less flexible due to the potential constraints of the smart contract languages on the value types and length.
- Size – There are limits of transaction size or block size. For example, on the Bitcoin blockchain, a default Bitcoin client only relayed OP_RETURN transactions up to 80 bytes, which was reduced to 40 bytes in February 2014. Ethereum has a block gas limit that limits the amount of gas all transactions in a block are allowed to use.
Instead of storing data, we could store a concise representation of data on-chain to enhance the integrity. For example, we can generate a unique digest/fingerprint of a given data item by hashing it using a consistent hashing function like SHA256. A hash function is a one-way function that is easy to compute, but hard to invert given the output of a random input. Consistent hash functions have the property where even a minor change in the input data (e.g., a single bit) results in a significant change in the resulting hash. The hash value can then be stored on-chain using a transaction. This alone results in a substantial reduction in data stored on the blockchain as the hash value is much smaller than the data, e.g., 1KB data vs 256-bit has value. When a user a presented with raw data for future validation (e.g., during auditing), he/she can generate the hash using the raw data and then compare it with the hash stored on the blockchain. Thus, by storing the hash on-chain one can ensure integrity validation.
If the integrity of a collection of data is to be tracked, we could calculate the Merkel tree hash for the entire dataset. Then the Merkel root can be stored on the blockchain. If a particular piece of data changes frequently and the integrity of each update needs to be tracked over time, we could consider each value to be a separate data point and calculate the Merkel tree hash similar to a dataset with multiple data items. Then, we could periodically record the hash of off-chain data on the blockchain. In either case, the resulting Merkel tree root is substantially smaller than the data itself. Also, the number of transactions required to record the hash on-chain is significantly reduced as only the Merkel tree root is stored for a large number of data points. This process is referred to as anchoring off-chain data to the blockchain.
- Integrity – As the data are anchored on to the blockchain in the form of a hash, the hash value can be compared against the off-chain data to verify their integrity.
- Cost – Anchoring reduces the cost of applying blockchain in terms of transaction fees (in public blockchains) and physical storage, as fewer transactions are sent to the blockchain.
- Scalability – Anchoring keeps complex and tedious business processes off-chain, which infrequently recording the hash value on-chain. Thus, enables blockchain-based applications to work within the scalability limits of a blockchain platform.
- Privacy – Blockchain transactions are immutable and can be view by all participants. Thus, storing hash values rather than original information can preserve data privacy.
- Opacity – Hash values are neither human-readable or can be restored into original files, which means that using anchoring may affect the transparency and auditability of on-chain data.
- Integrity – The raw data is stored off-chain, where the off-chain data store might not have the same level of security/integrity guarantees as to the blockchain. The raw data may be changed without authorisation. This change can be detected thanks to the hash of the original data stored on the blockchain. However, without additional measures, it will neither be possible to recover the original data nor to prevent the change from happening in the first place.
- Data loss – Because the raw data is stored off-chain, it may be deleted or lost. Only its hash value remains permanently on the blockchain.
- Data sharing – The on-chain data can be shared among transacting parties through the blockchain. Conversely, extra communication mechanisms and storage platforms are required for off-chain data sharing.
- In the selective content generation pattern, off-chain credential contents need to be hashed and stored on-chain to preserve integrity.
- Legal and smart contract pair pattern binds a legal agreement stored off-chain and the corresponding smart contract that codifies the legal agreement.
- Blockstack allows entities to register off-chain decentralised identifiers. To prove the existence of these off-chain identifiers, the system collects hashes of corresponding files and writes the hash values to the blockchain.
- Chainpoint is an open standard for creating a timestamp proof of any data, file, or process by generating a Merkle tree, and publishing the root of this tree to the Bitcoin blockchain.
- Proof-of-Existence (POEX.IO) service allows entering an SHA-256 cryptographic hash of a document into the Bitcoin blockchain as a proof-of-existence of the document at a certain time. The hash value guarantees the data integrity of the document. Hashingdna is a similar service that supports both BitCoin and Stellar.
- Chainy is a smart contract running on the Ethereum blockchain that stores a short link to an off-chain file and its corresponding hash value in one place.
- Researchers from Data61 and Laava ID Pty. Ltd. designed a scalable platform architecture for Laava‘s multitenant blockchain-based systems, in which every tenant has an individual permissioned blockchain to maintain their own data, while all tenant chains are anchored into the main chain periodically.