close
close

Deepseek Ai publishes Smallpond: A light data processing framework based on DuckDB and 3FS

Modern data workflows are increasingly stressed by growing data sets and the complexity of the distributed processing. Many organizations find that traditional systems with long processing times, memory restrictions and the effective administration of distributed tasks are struggling. In this environment, data scientists and engineers often spend excessive time with system maintenance instead of extracting knowledge from data. The need for a tool that simplifies these processes – without sacrificing the performance.

Deepseek Ai has recently released Smallpond, a light data processing framework based on DuckDB and 3FS. Smallpond aims to extend the efficient SQL analysis from DuckDB into a distributed environment. Due to the coupling of DuckDB with a 3FS-a-high-performance file system, which is optimized for modern SSDs and RDMA networks, Smallpond offers a practical solution for processing large data records without the complexity of long-term services or strong infrastructure effort.

Technical details and advantages

Smallpond is designed in such a way that he works seamlessly with Python and supports versions 3.8 to 3.12. Its design philosophy is based on simplicity and modularity. Users can quickly install the framework via PIP and start processing data with minimal setup. An important function is the ability to manually partition data. Regardless of whether the partitioning according to the file number, line numbers or according to a certain column -hash, this flexibility enables the processing to adapt to your respective data and infrastructure.

Under the bonnet, Smallpond uses for its robust performance at the national level in the execution of SQL queries. The framework continues to integrate into Ray to enable parallel processing via distributed computing notes. This combination not only simplifies scaling, but also ensures that workloads can be treated efficiently via several nodes. By avoiding persistent services, Smallpond reduces the operational overhead, which is typically connected to distributed systems.

installation

Python 3.8 to 3.12 is supported.

Quick start

# Download example data
wget 
import smallpond

# Initialize session
sp = smallpond.init()

# Load data
df = sp.read_parquet("prices.parquet")

# Process data
df = df.repartition(3, hash_by="ticker")
df = sp.partial_sql("SELECT ticker, min(price), max(price) FROM {0} GROUP BY ticker", df)

# Save results
df.write_parquet("output/")
# Show results
print(df.to_pandas())

Performance and knowledge

In the case of performance tests using the Graysort benchmark, Smallpond showed its capacity by sorting 110.5 tib in just more than 30 minutes and reaching an average throughput of 3.66 tib per minute. These results illustrate how effectively the frame uses the combined strengths of DuckDB and 3FS for calculation and storage. Such power metrics calm down that Smallpond can meet the needs of organizations that deal with terabytes for petabytes of data. The project -source nature of the project also means that users and developers work together at further optimizations and adapt the frame to a variety of applications.

Diploma

Smallpond represents a measured but significant step forward in the distributed data processing. It deals with the central challenges by expanding the demonstrated efficiency of DuckDB into a distributed environment, which is supported by the high profitability of 3FS. With a focus on simplicity, flexibility and performance, Smallpond offers a practical instrument for data scientists and engineers who are commissioned to process large data records. As an OpenSource project, it invites articles and continuous improvements from the community and makes it technical. Regardless of whether Smallpond is the administration of modest data records or scaling up to Petabyte levels, Smallpond offers a robust framework that is both effective and accessible.


Checkout The Github Repo. All credit for this research applies to the researchers of this project. Feel free to follow us Twitter And don't forget to join our 80k+ ml Subreddit.

🚨 Recommended research releases from Leselg Ai


Asif Razzaq is the CEO of Marktechpost Media Inc. His latest endeavor is the introduction of a media platform for artificial intelligence, market cups, which is characterized by detailed reporting on machine learning and deep learning messages, which are technically good and easy to understand by a wide audience. The platform has 2 million monthly views and illustrates its popularity of the audience.

🚨 Recommended open source AI platform: “Intellagent is an open source multi-agent frame for evaluating the complex conversation AI system (funded)