30 Ekim 2024 Çarşamba

Apache Arrow - Columnar Memory Format

Giriş
Açıklaması şöyle
Apache Arrow is a cross-language development framework for in-memory data. It provides a standardized columnar memory format for efficient data sharing and fast analytics. Arrow employs a language-agnostic approach, designed to eliminate the need for data serialization and deserialization, improving the performance and interoperability between complex data processes and systems.
Apache Arrow vs Apache Parquet 
Açıklaması şöyle
The Apache Arrow format project began in February 2016, focusing on columnar in-memory analytics workload. Unlike file formats like Parquet or CSV, which specify how data is organized on disk, Arrow focuses on how data is organized in memory.
Maven
Şu satırı dahil ederiz
<dependency>
  <groupId>org.apache.arrow</groupId>
  <artifactId>arrow-memory</artifactId>
  <version>6.0.1</version>
</dependency>

<dependency>
  <groupId>org.apache.arrow</groupId>
  <artifactId>arrow-vector</artifactId>
  <version>6.0.1</version>
</dependency>
Örnek
Yazma için şöyle yaparız
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.*;
import org.apache.arrow.vector.ipc.*;
import org.apache.arrow.vector.util.*;

// Set up the allocator and the schema for the vector
try (RootAllocator allocator = new RootAllocator(Integer.MAX_VALUE);
  VarCharVector vector = new VarCharVector("vector", allocator);
  ArrowWriter writer = new ArrowWriter(vector, new Schema(Collections. singletonList(vector.getField())))) {

  // Write data to the vector
  vector.setSafe(0, "Apache".getBytes());
  vector.setSafe(1, "Arrow".getBytes());
  vector.setSafe(2, "Java".getBytes());
  vector.setValueCount(3);

  // Write vector to a file
  try (FileOutputStream out = new FileOutputStream("arrow-data.arrow")) {
    writer.writeArrow(out.getChannel());
  }
}
Okuma için şöyle yaparız
// Now, let's read the data we just wrote
try (RootAllocator allocator = new RootAllocator(Integer.MAX_VALUE);
  ArrowReader reader = new ArrowReader(new FileInputStream("arrow-data.arrow") .getChannel(), allocator)) {

  // Read schema and load the data
  reader.loadNextBatch();

  // Get the vector
  try (VarCharVector vector = (VarCharVector) reader.getVectorSchemaRoot() .getVector("vector")) {
    // Iterate over the values in the vector
    for (int i = 0; i < vector.getValueCount(); i++) {
      System.out.println(new String(vector.get(i)));
    } } }

18 Temmuz 2023 Salı

Apache CarbonData

Giriş
Açıklaması şöyle
Apache CarbonData is an indexed columnar data format that is developed specifically for big data scenarios where fast analytics and real-time insights are critical.
Deep Integration with Spark
Açıklaması şöyle
CarbonData has been deeply integrated with Apache Spark, providing Spark SQL’s query optimization techniques and using its Code Generation capabilities. This makes it possible to directly query CarbonData files using Spark SQL, hence giving faster and more efficient query results.
Multi-Layered Structure
Açıklaması şöyle
Apache CarbonData is structured in multiple layers, which includes the table, segment, block, and page levels. This hierarchical structure allows efficient data retrieval by skipping irrelevant data during the query execution.

Table: A table is a collection of segments, and each segment represents a set of data files.

Segment: A segment contains multiple data blocks, where each block can store a significant amount of data.

Block: A block is divided into blocklets. Each blocklet holds a series of column pages, which are organized column-wise.

Page: The page level is where the actual data is stored. The data in these pages is encoded and compressed, making data retrieval efficient.

Avro Compabitability

Giriş
Açıklamaların detayı burada

1. Geriye Uyumluluk
İleride olanları ilgilendirir. Geriye uyumluluk, en son sürüm, geriden gelen sürümü ile üretilen veriyi okuyabilir demek

BACKWARD
En son iki schema geriye doğru uyumludur. 

BACKWARD_TRANSITIVE
Tüm schema'lar geriye doğru uyumludur. 

2. İleriye Uyumluluk
Geriden gelenleri ilgilendirir. İleriye Uyumluluk, gerideki schema, ilerideki schema tarafından üretilen veriyi okuyabilir demek

FORWARD
En son iki schema ileriye doğru uyumludur. En son schema tarafından üretilen veriyi sondan bir önceki okuyabilir demek.

FORWARD_TRANSITIVE
Tüm schema'lar ileriye doğru uyumludur. 

3. Full compatibility
FULL 
En son 2 schema birbirlerinin verilerini okuyabilirler

FULL_TRANSITIVE
Herkes birbirlerinin verilerini okuyabilirler

11 Haziran 2023 Pazar

CSV Kullanımı

Maven
Şöyle yaparız
<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-csv</artifactId>
    <version>1.8</version>
</dependency>
Gradle
Şöyle yaparız
compile "org.apache.commons:commons-csv:1.8"
CSVPrinter ile dosyaya veya belleğe yazılır

HttpComponents HttpClient Kullanımı

Giriş
1. CloseableHttpClient  nesnesi yaratılır. Bu sınıfı yaratmak için HttpClients kullanılır. CloseableHttpClient  sınıfı abstract'tır. HttpClient arayüzünden kalıtır.
2. HttpGet HttpPost gibi nesneleri gönderir.

Kullanım
Açıklaması şöyle
- Firstly, Create an instance of HttpClient class. (The HttpClient uses a HttpUriRequest to send and receive data.)
- Create an instance of HttpRequestBase class (e.g. HttpGetHttpPostHttpTraceHttpDelete, etc.) and set the necessary headers and parameters.
- Execute the request using HttpClient’s execute() method, and it returns an instance of HttpResponse class.
- Finally, Extract the response content using HttpResponse’s getEntity() method and process it as necessary.
Maven
Şu satırı dahil ederiz
<dependency>
  <groupId>org.apache.httpcomponents</groupId>
  <artifactId>httpclient</artifactId>
  <version>4.5.10</version>
</dependency>

HttpComponents HttpPost Sınıfı

Giriş
Şu satırı dahil ederiz.
import org.apache.http.client.methods.HttpPost;
setEntity metodu
Örnek - MultipartEntity
Şöyle yaparız. Burada response kod içinde kapatılıyor.
CloseableHttpClient client = HttpClients.createDefault();
HttpPost httpPost = new HttpPost("http://localhost:8080");

MultipartEntityBuilder builder = MultipartEntityBuilder.create();
builder.addTextBody("fileName", "Newwwww");
builder.addTextBody("fileType", "png");
builder.addBinaryBody("file", array);

HttpEntity multipart = builder.build();
httpPost.setEntity(multipart);

CloseableHttpResponse response = client.execute(httpPost);
client.close();

Örnek - JSON
Şöyle yaparız
// 1. Create HTTP Method
HttpPost httpPost = new HttpPost(url);
// 2. Set payload and content-type
httpPost.setEntity(new StringEntity(jsonBody, ContentType.APPLICATION_JSON));

5 Haziran 2023 Pazartesi

Apache Parquet - Data Warehousing İçin Kullanılır

Giriş
Kısaca şöyle
- Designed for efficient storage and processing of large datasets (analytics, data warehousing) especially in big data frameworks like Apache Spark and Apache Hive.
- Uses advanced compression techniques : Uses dictionary encoding and run-length encoding, to compress data efficiently. This reduces storage space requirements and speeds up data retrieval.
- Optimized for read-heavy workloads
Tarihçesi
Açıklaması şöyle
Apache Parquet (jointly developed by Twitter and Cloudera), widely used in Hadoop ecosystems like Pig, Spark, and Hive, is a favored file format for column storage. The format, which employs a binary representation, is language-agnostic. With its .parquet extension, Parquet is designed for the efficient storage of substantial data sets.
Satır (Row) Formatı Neden Kötü
Her satır farklı bir disk sektöründe depolanırsa veriye erişim daha verimsiz hale gelmeye başlıyor. Ancak analytic yapılan işlemlerde genellikle tüm satıra değil sadece belli sütunlara ihtiyaç duyuluyor. Açıklaması şöyle
However, it can be inefficient when dealing with analytics, where you often only need specific columns from a large dataset.

For example, imagine a table with 50 columns and millions of rows. If you’re only interested in analyzing 3 of those columns, a row-wise format would still require you to read all 50 columns for each row.
Sütun (Column) Formatı
Tüm sütunlar aynı disk sektöründe depolanıyor. Böylece sadece gerekli veri okunuyor.  Açıklaması şöyle. Ancak problem verinin güncellenmesi aşamasında oluşuyor
However, simply storing data in a columnar format has some downsides. The record write or update operation requires touching multiple column segments, resulting in numerous I/O operations. This can significantly slow the write performance, especially when dealing with large datasets.

In addition, when queries involve multiple columns, the database system must reconstruct the records from separate columns. The cost of this reconstruction increases with the number of columns involved in the query.
Hybrid Formatı
Açıklaması şöyle
The hybrid format combines the best of both worlds.

The format groups data into “row groups,” each containing a subset of rows. (horizontal partition.) Within each row group, data for each column is called a “column chunk.” (vertical partition)

In the row group, these chunks are guaranteed to be stored contiguously on disk.

In the past, I thought Parquet was purely a columnar format, and I’m sure many of you might think the same. To describe it more precisely, Parquet organizes data in a hybrid format behind the scenes.
Avro vs Parquet
Açıklaması şöyle
Avro and Parquet are both compact binary storage formats that require a schema to structure the data that is being encoded. The difference is that Avro stores data in row format and Parquet stores data in a columnar format.
In my experience, these two formats are pretty much interchangeable. In fact, Parquet natively supports Avro schemas i.e., you could send Avro data to a Parquet reader and it would work just fine.
Örnek
Şöyle yaparız
employee_id (int32)
name (string)
salary (double)
hire_date (timestamp)
Schema Evolution
Açıklaması şöyle. Yani sadece field type değiştirilemez.
(1) Adding new fields: Suppose we have a Parquet file containing data with the following schema: If we want to add a new field called "gender" to the schema, we can do so without having to rewrite the entire file.

(2) Modifying field types: Suppose we have a Parquet file containing data with the following schema: If we want to change the data type of the "age" field from int to long, we cannot do so without breaking schema compatibility. Because the field type has been changed, Parquet cannot read and write data to the file using the new schema without rewriting the entire file.

(3) Deleting fields: Suppose we have a Parquet file containing data with the following schema: If we want to delete the "gender" field from the schema, we can do so without having to rewrite the entire file. Parquet can read and write data to the file using the new schema without having to rewrite the entire file.

Predicate Pushdown
Açıklaması şöyle
Predicate pushdown is a technique used in Parquet and other columnar storage formats to improve query performance by filtering data before it is read from disk. When a query is executed on a Parquet file, the query engine can push down filters to the storage layer, which allows for faster query performance by reducing the amount of data that needs to be read from the disk.

The basic idea behind predicate pushdown is to push the filtering operation as close to the data as possible. Instead of reading the entire dataset from disk and then filtering it in memory, the query engine pushes the filter operation down to the storage layer, which applies the filter during the data read operation. This can significantly reduce the amount of data that needs to be read from the disk, which in turn reduces query execution time.