11 Haziran 2023 Pazar

CSV Kullanımı

Maven
Şöyle yaparız
<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-csv</artifactId>
    <version>1.8</version>
</dependency>
Gradle
Şöyle yaparız
compile "org.apache.commons:commons-csv:1.8"
CSVPrinter ile dosyaya veya belleğe yazılır

HttpComponents HttpClient Kullanımı

Giriş
1. CloseableHttpClient  nesnesi yaratılır. Bu sınıfı yaratmak için HttpClients kullanılır. CloseableHttpClient  sınıfı abstract'tır. HttpClient arayüzünden kalıtır.
2. HttpGet HttpPost gibi nesneleri gönderir.

Kullanım
Açıklaması şöyle
- Firstly, Create an instance of HttpClient class. (The HttpClient uses a HttpUriRequest to send and receive data.)
- Create an instance of HttpRequestBase class (e.g. HttpGetHttpPostHttpTraceHttpDelete, etc.) and set the necessary headers and parameters.
- Execute the request using HttpClient’s execute() method, and it returns an instance of HttpResponse class.
- Finally, Extract the response content using HttpResponse’s getEntity() method and process it as necessary.
Maven
Şu satırı dahil ederiz
<dependency>
  <groupId>org.apache.httpcomponents</groupId>
  <artifactId>httpclient</artifactId>
  <version>4.5.10</version>
</dependency>

HttpComponents HttpPost Sınıfı

Giriş
Şu satırı dahil ederiz.
import org.apache.http.client.methods.HttpPost;
setEntity metodu
Örnek - MultipartEntity
Şöyle yaparız. Burada response kod içinde kapatılıyor.
CloseableHttpClient client = HttpClients.createDefault();
HttpPost httpPost = new HttpPost("http://localhost:8080");

MultipartEntityBuilder builder = MultipartEntityBuilder.create();
builder.addTextBody("fileName", "Newwwww");
builder.addTextBody("fileType", "png");
builder.addBinaryBody("file", array);

HttpEntity multipart = builder.build();
httpPost.setEntity(multipart);

CloseableHttpResponse response = client.execute(httpPost);
client.close();

Örnek - JSON
Şöyle yaparız
// 1. Create HTTP Method
HttpPost httpPost = new HttpPost(url);
// 2. Set payload and content-type
httpPost.setEntity(new StringEntity(jsonBody, ContentType.APPLICATION_JSON));

5 Haziran 2023 Pazartesi

Parquet

Giriş
Açıklaması şöyle
Apache Parquet (jointly developed by Twitter and Cloudera), widely used in Hadoop ecosystems like Pig, Spark, and Hive, is a favored file format for column storage. The format, which employs a binary representation, is language-agnostic. With its .parquet extension, Parquet is designed for the efficient storage of substantial data sets.
Avro vs Parquet
Açıklaması şöyle
Avro and Parquet are both compact binary storage formats that require a schema to structure the data that is being encoded. The difference is that Avro stores data in row format and Parquet stores data in a columnar format.
In my experience, these two formats are pretty much interchangeable. In fact, Parquet natively supports Avro schemas i.e., you could send Avro data to a Parquet reader and it would work just fine.
Örnek
Şöyle yaparız
employee_id (int32)
name (string)
salary (double)
hire_date (timestamp)
Schema Evolution
Açıklaması şöyle. Yani sadece field type değiştirilemez.
(1) Adding new fields: Suppose we have a Parquet file containing data with the following schema: If we want to add a new field called "gender" to the schema, we can do so without having to rewrite the entire file.

(2) Modifying field types: Suppose we have a Parquet file containing data with the following schema: If we want to change the data type of the "age" field from int to long, we cannot do so without breaking schema compatibility. Because the field type has been changed, Parquet cannot read and write data to the file using the new schema without rewriting the entire file.

(3) Deleting fields: Suppose we have a Parquet file containing data with the following schema: If we want to delete the "gender" field from the schema, we can do so without having to rewrite the entire file. Parquet can read and write data to the file using the new schema without having to rewrite the entire file.

Predicate Pushdown
Açıklaması şöyle
Predicate pushdown is a technique used in Parquet and other columnar storage formats to improve query performance by filtering data before it is read from disk. When a query is executed on a Parquet file, the query engine can push down filters to the storage layer, which allows for faster query performance by reducing the amount of data that needs to be read from the disk.

The basic idea behind predicate pushdown is to push the filtering operation as close to the data as possible. Instead of reading the entire dataset from disk and then filtering it in memory, the query engine pushes the filter operation down to the storage layer, which applies the filter during the data read operation. This can significantly reduce the amount of data that needs to be read from the disk, which in turn reduces query execution time.