PeriFlow

May 17, 2024

Meet PeriFlow: The Generative AI Powerhouse

In the rapidly evolving world of generative AI, there's a constant search for tools that provide high performance without breaking the bank. Enter PeriFlow, a solution that promises to be the speediest generative AI serving engine available. This impressive engine caters to needs of both power and versatility.

How PeriFlow Propels Performance

PeriFlow is the brainchild of seasoned experts, drawing upon profound research and vast experience in operating generative AI models. It leverages multi-layer optimizations, along with scheduling and batching techniques, ensuring a seamless experience. The technology underpinning its batching capabilities is even patented in the US and Korea, highlighting its innovation.

Supporting an Array of Models

The strength of PeriFlow lies in its wide-ranging support for generative AI models, such as:

GPT, GPT-J, and GPT-NeoX
MPT and LLaMA
Dolly, OPT, and BLOOM
CodeGen, T5, FLAN, and UL2

These models have become fundamental to applications that include chatbots, translation services, content summarization, code generation, and even caption creation. Dealing with these models has historically been expensive and complex, but PeriFlow makes this a thing of the past.

Decoding and Data Types

To cater to diverse requirements, PeriFlow supports decoding options like greedy, top-k, top-p, beam search, and stochastic beam search. Moreover, it's compatible with data types including fp32, fp16, bf16, and int8. This range of options allows for flexibility depending on the specific needs.

Speed and Cost Efficiency

One of PeriFlow's most remarkable features is its performance compared to competitors. It notably outpaces NVIDIA Triton+FasterTransformer in both latency and throughput across a spectrum of LLM sizes. For instance, it offers a tenfold throughput enhancement for a GPT-3 175B model without compromising on latency.

Accessing PeriFlow

PeriFlow presents two convenient usage approaches:

PeriFlow Container

For those who prefer managing their environment, the PeriFlow Container can be operated on-premise.

PeriFlow Cloud

Alternatively, users may opt for the PeriFlow Cloud service for an auto-managed solution.

Utilizing PeriFlow

Deploying and using a generative AI model with PeriFlow is a straightforward two-step process:

Deploy your model within PeriFlow
Send inference requests for downstream applications to your model

For instance, you can send an inference request to an HTTP endpoint provided after deployment:

curl http://<periflow-endpoint>/v1/completions \
-H "Content-Type: application/json" \
-d '{ "prompt": "Say this is a test", "max_tokens": 5}'

Responses are returned in a clear and concise format:

{
  "choices": [
    {
      "index": 0,
      "text": ", say it works!",
      "tokens": [11, 910, 340, 2499, 0]
    }
  ]
}

Company Credentials

FriendliAI, the innovative minds behind PeriFlow, is headquartered in Redwood City, California, with an additional hub in Seoul, Korea. They maintain a strong commitment to their users, evident through their comprehensive privacy policy and transparent service agreements.

Interested in more details or trying out PeriFlow? The knowledgeable team at FriendliAI is reachable at contact@friendli.ai.

Pros and Cons

PeriFlow is not without its pros and cons:

Pros:

Exceptional speed and throughput performance
Broad compatibility with various AI models and decoding options
Ability to run on-premise or via cloud for added flexibility
Patented batching technology

Cons:

Some initial understanding of deployment and inference requests is necessary
Potentially limited by the specific AI models and workloads it supports

In conclusion, PeriFlow stands out as a significant advancement in serving generative AI models efficiently. With its focus on speed, versatility, and cost-effective operation, it's positioned to be an invaluable asset to various AI-driven applications.

Visit the website