Begin transmission…
Fingerprinting database backups with homomorphic hashes
Previously, we talked about the set semantics of homomorphic hashes.
Database backups are a core part of the basin's data pipeline. We are exploring various techniques to fingerprint the backups to ensure data integrity and tamper resistance throughout the pipeline. In this post, I would like to demonstrate an application that uses set-based hashes to fingerprint database backups.
For this post, we will assume we are collecting the backup files, such as parquet exports, in a directory.
We want to have a fingerprint for this directory. The fingerprint should change whenever we add a new backup file or delete or modify an existing file. If there are no changes in the directory, the fingerprint should stay the same. In this way, we can keep track of state changes.
To generate the fingerprint, we will use the following method.
Read each file and generate a 32-byte hash digest with a performant one-way hash function like Blake3.
Initialize the HashSet (homomorphic hash with set semantics) with a default value. This is the fingerprint of the directory when there are no files present.
Insert the file hashes into the HashSet. Persist the Hashset for later use. You should update it when you are adding new files or modifying existing files.
In the future, if you want to verify that the backups are not corrupt, you could hash the files again, add those hashes into a Hash Set, and check if the fingerprint matches the one you have. Here, we gave an example of parquet files in a directory, but you could apply this technique to any collection of files. Here is an example in Rust. It iterates through all the files in a directory and calculates its fingerprint.
Ecosystem Spotlight
by Marla Natoli
We're big believers in the DePIN movement and love to hear about cool new concepts launching. This week we learned about Wingbits, the newest DePIN to join the Peaq ecosystem, taking a new approach to the flight tracking industry that has proven profitable but relies on data from volunteers. Wingbits is taking a new approach that rewards the contributors that collect and share valuable flight tracking data. As more and more data surfaces from DePINs, we're excited to see our data management solutions come to life to help make that data more accessible, collaborative and secure. If you're working in the DePIN space and looking for flexible solutions to bring utility and/or transparency to your data, we'd love to hear from you.
Measuring product success
by Jim Kosem
There are many ways to measure how well your product is doing. You can do what I do and go talk to people using them and you can do what is a lot easier and automatic which is use data to find out. Yet, this is one thing that designers often overlook or don’t think about, as it is often the perceived job of the Business Analyst (BA) or Project Manager (PM) to sort out. But our job, especially for me in a place that works with data and making it safe, is to look at how we can mobilise our design to understand that data.
This typically means that part of designing the screens, views and interactions is also pointing out which of those we can gather data on and how well we can do this. Does the button that I think should go there at the top not only help the user get to where they’re going quicker, but also be part of a larger way to see how they complete the entire sequence of building a project quicker? What can we track and gather enough whilst respecting privacy to understand user interactions in the aggregate? In some ways this is a bit of a paradox for an encryption focused company, but yet another good problem to have when thinking about data and how to treat it.
Studio 101 workshop—plus, using OpenAI to annotate YouTube videos
by Dan Buchholz
A few months ago, we opened up early access to the Tableland Studio. It's a web app and CLI that lets you create & manage teams, projects, and tables for your Tableland deployments. We continue to add more features and wanted to walk through an end-to-end flow for how exactly you can use the Studio, such as basic team/project setup in the UI, deploying tables, and uploading data/tables via the CLI.
Also, we updated the Studio CLI docs to the latest (here), including new commands for creating projects, deployments (listing or creating), and importing data from CSV files. The workshop touches on all of these.
You can check out the video below for the full workshop. In it, we walk through how to:
Sign in with your web3 wallet
Create personal or collaborative teams and projects
Set up table “blueprints” and deploy them to supported chains
Inspect teams, project, and tables from the command line
Write CSV file data to Studio tables
Import existing tables (e.g., via smart contracts or SDK)
OpenAI Whisper & GPT-4
For YouTube videos, a nice UX feature is to have timestamps/chapters that link to specific video segments as well as subtitles for accessibility support. So, I went down the path of creating a simple python script that does exactly that! You can see the result in the video above.
With Whisper, the script generates a transcription of an mp4 video as a subtitles (.srt
) file, which can be included during the video upload process. And with GPT-4, the script takes the outputs from Whisper and summarizes them with timestamps at certain points in the video.
You can check out the full source code here and use it with your own videos! Just note that chunking isn't implemented (yet!), so it expects the processed video-to-audio input for Whisper is 25MB or less (e.g., a 20 minute video should be within this range), and GPT-4 has a context window length that could be exceeded. Here's a quick simplification/snippet of what it does:
Load environment variables for OpenAI API key and organization.
Load the OpenAI API client and models.
Use Whisper to transcribe the video.
Generate subtitles and pass it to GPT-4
import whisper
from dotenv import find_dotenv, load_dotenv
from openai import OpenAI
# Load env vars for OpenAI API and organization
load_dotenv(find_dotenv())
openai_api_key = getenv("OPENAI_API_KEY")
openai_api_org = getenv("OPENAI_API_ORG")
# Load OpenAI API client
client = OpenAI(
organization=openai_api_org,
api_key=openai_api_key,
)
# Load whisper model
options = whisper.DecodingOptions(fp16=False, language="en")
whisper_model = whisper.load_model("base.en")
# Transcribe video
video_path = "/path/to/video.mp4"
result = whisper_model.transcribe(video_path, **options.__dict__, verbose=False)
# Logic take the result and format/generate .srt subtitles
segments = result["segments"]
srt_text = []
for i, segment in enumerate(segments):
# SRT index starts from 1
srt_index = i + 1
srt_text.append(str(srt_index))
# Formatting start and end times
start_time = format_time(segment["start"])
end_time = format_time(segment["end"])
srt_text.append(f"{start_time} --> {end_time}")
# Adding text
srt_text.append(segment["text"].strip())
# Add an empty line after each segment
srt_text.append("")
segments = "\\n".join(srt_text)
print(segments) # Show the subtitles
# Summarize segments for YouTube timestamps
response = client.chat.completions.create(
model="gpt-4",
messages=[
{
"role": "system",
"content": """You are a helpful assistant...""", # Note: a more prescriptive prompt should be used
},
{
"role": "user",
"content": segments,
},
],
temperature=0.7,
)
summary = response.choices[0].message.content.strip()
print(summary) # Show YouTube timestamps summary
Using JETI for inserting/reading IPFS CIDs with the SDK
by Dan Buchholz
We recently updated the @tableland/jeti
package (here) with some refactoring and conformance with the latest Tableland SDK APIs. You can check out the latest usage in our docs: here.
JETI is designed to add extensibility to the @tableland/sdk
and transform writes/reads when querying a Database
. There are a few examples provided in the package:
Truncate values at 1024 bytes, which is the max length a single table cell can hold.
Upload/pin cell content of any length to IPFS with a local or remote pinning provider.
Encrypt/decrypt data in cells (note: more of a demonstration than production-grade).
Create a custom plugin to process data however you'd like.
The IPFS processor is extremely useful since it helps let Tableland work with large media that can't be stored in a cell. To use it, you set up the processor, call it with string templating on writes to create/insert a CID, and resolve the result's CID from read queries. Here's a quick example of how it works:
import { pinToLocal, skip } from "@tableland/jeti";
import { Database } from "@tableland/sdk";
// Set up a `signer` and pass to the database
const db = new Database({ signer });
const tableName = "my_table_31337_2"; // Assuming the table was created beforehand
// Assuming you're running a local IPFS node, such as with the IPFS Desktop app
const localPinner = pinToLocal({
host: "127.0.0.1",
port: 5001,
protocol: "http",
});
const contentToPin = "Hello world"; // A string, or a file buffer (Uint8Array)
// This will generate:
// INSERT INTO my_table_31337_2 (val) values ('bafybeiabfiu2uipule2sro2maoufk2waokktnsbqp5gvaaod3y44ouft54');
const sql = await localPinner`INSERT INTO ${skip(
tableName
)} (val) VALUES ('${contentToPin}');`;
const { meta: insert } = await db.prepare(sql).all();
await insert.txn?.wait();
// Process content from CID in table read
const { results } = await db.prepare(`SELECT * FROM ${tableName}`).all(); // Results contain CID
const resultsWithCIDsResolved = await localPinner.resolve(results, ["val"]);
console.log(resultsWithCIDsResolved);
// [
// {
// id: 1,
// val: 'Hello world'
// }
// ]
This is not only useful for large files, but recall there are type and feature constraints with the Tableland SQL language where non-deterministic behavior is blocked. The only types supported are TEXT, BLOB, or INT/INTEGER.
For example, if you want to write floating points, you could simply write the data as TEXT or BLOB. Or, maybe you want to write something like a large text embedding, which is a multi-dimensional vector of floating points. With JETI, you could easily pin that content to IPFS and process the values on the client-side and use it in ML/AI use cases!
Other updates this week
We’re kicking off internal planning sessions for 2024, so keep an eye out for external updates from that, including new roadmap updates!
End transmission…
Want to dive deeper, ask questions, or just nerd out with us? Jump into our Telegram or Discord—including weekly research office hours or developer office hours. And if you’d like to discuss any of these topics in more detail, comment on the issue over in GitHub!