DeepSparse Server
DeepSparse is an inference runtime taking advantage of sparsity with neural networks offering GPU-class performance on CPUs.
Overview
DeepSparse is a CPU inference runtime that takes advantage of sparsity to accelerate neural network inference. The server lets you to set up a model-serving endpoint running DeepSparse to send raw data to DeepSparse over HTTP and receive the post-processed predictions.
Configuration
DeepSparse server supports any task from DeepSparse, such as Pipelines including NLP, image classification, and object detection tasks. An up-to-date list of available tasks can be found in the DeepSparse Pipelines Introduction.
The default configuration of this app initialize DeepSparse server with the zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none
BERT model, launched with the following command:
deepsparse.server --task sentiment_analysis --model_path zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none
You can customize the configuration of the DeepSparse server by adjusting the Docker args in the Koyeb Service settings page. For example, to perform object detection using a YOLOv8 model, change the model_path
to zoo:cv/detection/yolov8-s/pytorch/ultralytics/coco/pruned50_quant-none
and the task
to yolov8
.
Try it out
Once the DeepSparse server is deployed, you can start sending request to the /v2/models/sentiment_analysis/infer
endpoint to get predictions. For example, to receive BERT's inference of the sentiment of a Tweet, you can send the following request:
$ curl https://<YOUR_APP_NAME>-<YOUR_KOYEB_ORG>.koyeb.app -X POST \
-H "Content-Type: application/json" \
-d '{"sequences": "Just deployed my @neuralmagic DeepSparse Server on @gokoyeb and I must say! Match made in heaven 😍"}'
{"labels":["negative"],"scores":[0.8791840076446533]}%