Blog

Tags:

Generative AI
Artificial Intelligence
API Integration
Prompt Engineering

Navigating the LLM Landscape: A Journey of Growth and Exploration

Posted On: 17 June, 2025

By Aneesh Nathani

Introduction

Following the breakthrough in GenAI with ChatGPT, a plethora of foundational LLMs, including both closed and open models, have emerged. Hugging Face boasts an extensive array of open LLMs. Amid the abundance of closed and open LLM options, various other techniques have surfaced, making it challenging to determine the right path and option. These techniques can lead to confusion when deciding whether to use closed LLM models as-is (via APIs), self-host open LLM models, set up quantized models, or fine-tune models. The ultimate choice of approach depends on the accuracy required for a given use case. However, this blog post aims to provide a structured framework for a progressive journey towards developing solutions or building internal organizational capabilities. This progression starts from simple approaches and gradually moves toward more complex ones.

We have divided this blog into a two-part series: LLM experiments without fine-tuning and with fine-tuning. In this blog, we will concentrate on the progressive journey of LLM experiments without involving any fine-tuning. Please note that this blog focuses on embarking on a progressive journey through various approaches and options for using and hosting LLMs rather than selecting the right LLM.

Prompting Progression Strategy

Prompting techniques are widely known as an effective way to elicit creative and concise responses from LLMs. Prompt engineering has also evolved, and here, we present a commonly discussed progression strategy from an experimentation perspective to achieve better and more accurate results.

You can find more prompting techniques here.

Progressive Journey for LLM Experiments

Just as there is a progression strategy for prompting to improve results, there are various techniques for using, experimenting with, and hosting LLMs. The following diagram outlines a structured approach for embarking on a progressive journey to experiment with different LLM usage methods. All the approaches mentioned below start from simplicity and gradually move towards complexity, considering ease-of-use, cost, and resource considerations.

Progression journey for LLM experiments, showcasing the process of refining and optimizing large language models to enhance their performance and reliability

1. Accessing Closed LLMs via External APIs

This is the most straightforward way to harness the power of LLMs. Closed LLMs can easily integrate into applications with minimal effort and optimized costs. It is the simplest method to create quick proofs of concept or even for production to expedite application development.

There are various good-quality closed LLMs available from different providers. Azure OpenAI and OpenAI offer powerful LLMs, such as the GPT series models (GPT-3.5-turbo and GPT-4). Anthropic’s Claude, Google’s PaLM, Cohere, Jurrasic, BloombergGPT, and others represent a handful of closed LLMs that are third-party hosted and can be interacted with via APIs. The number of tokens used in the input and generated as part of the output determines the pricing for most of these models.

These APIs typically have different context window variants, so it’s crucial to consider these when designing a solution for a specific use case. For example, GPT-4 offers two context window variants: 8K and 32K tokens.

This option provides a cost-effective and less risky approach compared to other options. It also offers the flexibility to change the underlying model by simply updating the API implementation.

2. Self-Hosted Quantized Open LLMs

Quantization, in simple terms, involves compressing model weights to reduce the model’s size and memory footprint. Quantization converts a 32/16-bit model into lower precision, such as 8-bit or 4-bit, significantly reducing the memory footprint of the model. An 8-bit model requires four times less memory, making it more affordable. By reducing the model’s size, memory requirements, and computational demands without significantly sacrificing model accuracy, quantized models become more accessible to users who may not have access to high-end GPUs like Nvidia A100.

Quantized models can be loaded and executed on consumer-grade CPUs or GPUs. You can deploy many quantized models available out-of-the-box with an inference endpoint on a regular-grade CPU or GPU for inference.

A few examples include GPT4All, GPTQ, ollama, HuggingFace, and more, which offer quantized models available for direct download and use in inference or for setting up inference endpoints.

This option is cost and resource-efficient, as already quantized models require minimal CPU/GPU and RAM requirements to self-host an inference endpoint. However, achieving optimal inference endpoint performance requires a superior design and better resource planning.

Note: You must evaluate these quantized models for accuracy for the given use cases before any production usage. Also, only certain types of hardware may support these models and quantized techniques, so you should verify the hardware requirements before utilizing these models.

3. Cloud-Hosted Open LLMs

Open LLMs are significant and demand-intensive hardware requirements (CPU/GPU/high RAM). The RAM size required for setting up an inference endpoint varies depending on the number of parameters in LLM models (e.g., 7 billion, 13 billion, 65 billion). The bigger the model, the higher the GPU/RAM requirement.

You can set up these models either on-premises or on-cloud. Most cloud providers, such as Azure, AWS, and GCP, offer infrastructure and libraries for open LLMs, enabling quick setup of an inference endpoint.

Models can be set up on-premises, too, but would require a higher initial investment to procure hardware. We recommend opting for a cloud-based approach for quickly deploying open LLMs without a significant upfront cost, especially considering the dynamic nature of this field and ongoing research to make these models accessible on consumer-grade hardware.

Examples of cloud-hosted open LLM models include Llama2, Falcon, Dolly, Flan-T5, Vicuna, etc.

This option is the most expensive, as setting up an inference endpoint for foundational LLM models with many parameters demands substantial GPU-based hardware and increased RAM. Meeting low-latency service level agreements and high availability requirements would necessitate additional infrastructure, making the solution costly. Therefore, it is crucial to evaluate the costs against the business value.

4. Self-Hosted and Self-Quantized Open LLMs

This option involves using various quantization techniques and libraries to quantize LLM models independently. It focuses on post-training quantization approaches where quantization techniques are applied to pre-trained foundational models to reduce their size, memory, and computing power requirements. During the quantization process, it’s important to note that these models don’t undergo any further training.

You can quantize a model into 8-bit, 5-bit, 4-bit, etc. precision models.

Here are a few techniques to quantize/compress the LLM base foundational models.

GGML/GGUF (for CPU)
GPTQ (for GPU)
NF4 (for GPU)

Here are a few libraries that can help quantize LLM-based foundational models.

Please note that Hugging Face’s Transformers library integrates Auto-GPTQ and Bitsandbytes. Also, understand that this option is complex and requires in-depth research skills to accomplish quantization with various quantization techniques available. It would be best if you used pre-quantized models unless it’s necessary and you have a deep understanding of different quantization techniques.

Implementing this approach involves a substantial initial investment in skilled engineers, resources, and time. If we experiment with promising outcomes to create well-established libraries, we can significantly reduce the infrastructure costs later.

Conclusion

In this post, we have explored diverse options for a progressive journey in LLM model-based experiments for organizations or projects. All options are arranged from simple to complex approaches and suggest advanced frameworks to make informed decisions for any use case. This post has focused explicitly on trying out-of-the-box models and strategies without fine-tuning.

Originally published on Medium - https://medium.com/@genai_cybage_software/integrating-gen-ai-in-your-product-ecosystem-c6de8d136b03

Comment (0)

Read Other Blogs

10 min read

Blog

Building Intelligent AI Systems_Thumbnail.webp

Artificial Intelligence

Generative AI

By Aneesh Nathani

Posted On: 24 July, 2025

Building Intelligent AI Systems: Understanding Agentic AI...

The artificial intelligence landscape is evolving beyond simple task automation toward intelligent, human-assisted…

Like 13

(0)

3 min read

Blog

Generative AI

Large Language Models

Software Development

IT Services

By Aneesh Nathani

Posted On: 21 May, 2025

Integrating Gen AI into Your Product Ecosystem

An insightful roadmap for software companies and enterprises to explore Gen AI applications, navigate AI maturity…

216

Like 29

(0)

7 min read

Blog

AI in Supply Chain

Supply Chain Automation

Predictive Analytics

Ecommerce

By Ravi Sharma

Posted On: 24 October, 2024

Supply Chain a Trillion Dollar Industry with AI Evolution

The supply chain industry, a billion-dollar behemoth, is on the verge of a significant transformation. As global…

356

Like 46

(0)

6 min read

Blog

Hi-tech

Generative AI

SDLC

Artificial Intelligence

By Raunaq Lalwani

Posted On: 12 September, 2024

Empowering Software Development with Generative AI: The Next...

Overview Gone are the days of clunky, siloed development processes. The future of technology is brimming with…

746

Like 79

(0)

7 min read

Blog

AI in marketing

digital marketing with AI

Digital Advertising

Machine Learning

By Susheel Kumar

Posted On: 17 June, 2024

Marketing in 2024 with AI: Tips, Tricks and Checklists

The marketing sector has undergone significant changes in recent years, greatly driven by technological disruptions…

629

Like 70

(0)

6 min read

Blog

Fintech

Payment Tech

Lending & Finance

Wealth & Crypto

Fintech Solutions

By Amit Patel

Posted On: 23 May, 2024

Adapting to the Future: FinTech's Influence on the Financial...

“The financial system is being rewired, and Fintech is the wire.” – Jim Marous, Fintech Author and Speaker…

700

Like 70

(0)

5 min read

Blog

FMS

Telematics

Fleet Management

Fleet Safety

Supply chain and Logistics

Supply chain Management

By Suresh Mamunuru

Posted On: 29 December, 2020

Fueling the Future of Fleet Management System

In the last few years, the global logistics landscape evolved at an unimaginable pace. And why wouldn't it? It had…

118

Like 15

(0)

2 min read

Blog

Emerging Technologies

Technology

Hi-tech

By Debasis Ray

Posted On: 27 July, 2020

Embrace Technology

Embracing technology has always been a key part of strategy. When Robert Iger took over as the CEO of the Walt…

131

Like 14

(0)

3 min read

Blog

Collective Talent

Talent

Workforce

Human Resource

Data Analytics

By Elston Pimenta

Posted On: 13 December, 2019

An Organisation is as Strong as the Collective Talent of its...

A good employer should always …” In the era of continuous technical disruptions, how you finish that sentence could…

104

Like 10

(0)

4 min read

Blog

Digital Transformation

Chatbots

Digital Transformation Trends

Digital Trends 2019

Cloud

IoT

Chatbot

Blockchain

By Jagat Pal Singh

Posted On: 14 February, 2019

Top Digital Transformation Trends For 2019

Year 2019 should see good adoption of predictive/prescriptive analytics across the industries. The most talked…

105

Like 11

(0)

4 min read

Blog

Digital Transformation

Digital Trends 2019

IoT

By Jagat Pal Singh

Posted On: 16 December, 2018

The Role of AI, ML, and IoT in Digital Transformation in...

Artificial Intelligence and Machine Learning represent the mind of the artificial world, whereas the IoT represents…

113

Like 11

(0)

Explore All

Where Your Needs Meet Our Expertise

Digital Product Engineering

Technology Solutions

Artificial Intelligence

Platform & Integrations

Digital Transformation

Support Services

GCC as a Service

Where Tradition Meets Tomorrow

Media & Advertising

Software & Hi-Tech

Travel & Hospitality

Retail

Supply Chain & Logistics

Healthcare & Life Sciences

FinTech

Setting Ourselves Apart

ExcelShore®

Product Intensive Engineering (PIE)

People. Passion. Perfection.

Company Overview

Responsible Business

Awards and Recognitions

Newsroom

Resource Center

Where Aspirations Meet Opportunities

Open Positions

Why Join Cybage

Subscribe for Updates

Introduction

Prompting Progression Strategy

Progressive Journey for LLM Experiments

Conclusion

Comment (0)

Read Other Blogs