Categories:

#35

Apache Arrow is ranked #35 in the Data Preparation Software product directory based on the latest available data collected by SelectHub. Compare the leaders with our In-Depth Report.

Apache Arrow Benefits and Insights

Why use Apache Arrow?

Key differentiators & advantages of Apache Arrow

  • Fast Data Processing:  Maximizes data processing efficiency when computing large data sets through its columnar memory layout. Executes a single operation over a large data set through parallel processing. 
  • Seamless Data Transfer: Build rich services to exchange data according to application-defined semantics through client-server RPC frameworks. Move data between systems at minimal cost — no need to create custom connectors for every single system. Reuse algorithm libraries across multiple languages by virtue of its standardized columnar format. 
  • Ubiquitous Codebase: Provides a standard for representing data frames in memory and allows multiple languages to refer to the same in-memory data. Develop a native columnar query execution engine through a ubiquitous C++ codebase that can be used from Python, R and Ruby. 
  • Integrations: Improves performance and supports innovation through its integrations with SQL, Panda and R, saving time and effort on reinventing common components. 

Industry Expertise

The platform enables analytics application development for software companies across the world.

Apache Arrow Reviews

Average customer reviews & user sentiment summary for Apache Arrow:

User satisfaction level icon: great

14 reviews

86%

of users would recommend this product

Key Features

  • Libraries: Work with data through its native libraries for C++, C#, Go, Java, JavaScript, MATLAB, Python, R and Ruby — no need to implement a columnar format for each project separately. 
  • Faster CSV Reading: Reads CSV into Pandas more than 10 times faster by virtue of its columnar storage design. Stores and reads data in parallel through record batches — 2D data structures containing columns of data of equal length. 
  • Faster UDFs in PySpark: Efficiently transfers data between Java Virtual Machines (JVMs) and Python processes with vectorized user-defined functions (UDFs), doing away with serialization/deserialization and enabling faster data processing. 
  • Record Batches: Reads a folder containing many data files, and even subfolders, into a single dataframe by virtue of record batches. Some libraries like C++, Python and R support reading entire directories of files and treating them as a single dataset. 
  • Read/Write Parquet Files: Reads parquet files into Python and R through translators that convert data into language-specific in-memory formats. Writes data held in-memory in tools like Pandas and R to disk in Parquet format. 
  • Memory Mapping: Works with data bigger than allocated memory space and allows data sharing across languages and processes through local mapping of its IPC files. 

Limitations

At the time of this review, these are the limitations according to user feedback:

  •  Its memory layout isn’t ideal for workloads involving accessing multiple attributes from a single entity, as in OLTP loads. 
  •  Its code can be slow to execute sometimes. 

Suite Support

The vendor does not offer traditional support for its products, rather relying on providing documentation and directing developers to the open-source community to answer their questions.

mail_outlineEmail: Not available.
phonePhone: Not available.
schoolTraining: Besides vendor-provided documentation, most training is accomplished through asking questions on Apache’s StackOverflow forums.
local_offerTickets: Not available.

Your review has been submitted
and should be visible within 24 hours.
Your review

Rate the product