LangChain Agents with Tools Tutorial: Error Handling and Safety Patterns
LangChain Agents with Tools Tutorial: Error Handling and Safety Patterns
LangChain agents are smart helpers that can think and use tools to get tasks done. Imagine giving your computer a brain and a toolbox; that’s kind of what LangChain agents are like. They can browse the internet, calculate numbers, or even talk to other programs.
But just like any helper, sometimes things don’t go as planned. Tools might fail, the internet could be down, or the agent might get confused. This is where error handling and safety patterns become super important.
This guide will show you how to build robust LangChain agents that can handle unexpected problems gracefully. You will learn how to make your agents reliable and safe to use. We will cover everything from simple retries to advanced security considerations.
Understanding LangChain Agents and Tools
Before we dive into errors, let’s quickly review what LangChain agents and their tools are. An agent uses a language model, like a super-smart chatbot, to decide what to do next. It looks at your request and thinks about the steps needed to complete it.
Part of its thinking involves choosing the right “tools” from its toolbox. These tools are like mini-programs or functions that the agent can call upon. For example, a tool could be a calculator, a web search function, or a way to send an email.
The agent picks a tool, gives it some input, and then waits for the tool to give back an answer. It uses this answer to continue its thinking process. This cycle of thinking, acting, and observing helps the agent achieve its goal.
The Problem: Why Things Go Wrong
Even the smartest agents and most useful tools can run into trouble. Imagine your agent trying to look up today’s weather, but the weather website is down. Or maybe the agent asks a tool to add “apple” and “banana,” which doesn’t make sense.
These unexpected situations are called tool execution errors. They can stop your agent in its tracks, making it seem broken or unhelpful. Sometimes, the error might even give back confusing or wrong information.
We need to teach our agents how to deal with these problems gracefully. This means planning for failures, rather than just hoping they won’t happen. By thinking ahead, we can make our agents much more robust and reliable.
Core Concepts of Error Handling
Handling errors well means having a plan for different kinds of problems. Let’s look at some key ideas that help us build resilient agents. These concepts are like different backup plans for when things go wrong.
Graceful Degradation
Graceful degradation means that if one part of your system fails, the whole thing doesn’t just crash. Instead, it tries to keep working, perhaps with less functionality or by using a simpler method. For an agent, this might mean if a specific tool isn’t working, the agent tries to solve the problem with other tools or provides a less detailed answer.
Imagine a smart home system where the fancy weather sensor breaks. Instead of saying “I can’t tell you the weather,” it might check a basic online weather report. It’s still providing a service, even if it’s not the best possible one. This makes the user experience much smoother.
You want your LangChain agent to always try its best to deliver something useful. Even if it can’t complete the primary task, it should offer a reasonable fallback or explain the issue clearly. This prevents frustration and keeps your agent helpful.
Retry Mechanisms
Sometimes, a tool might fail just because of a temporary hiccup, like a brief internet connection drop. A retry mechanism means your agent tries the same action again after a short wait. It’s like knocking on a door again if no one answers the first time.
You usually don’t want to retry endlessly, though. There should be a limit to how many times an agent tries. Often, we use something called “exponential backoff,” which means waiting a little longer with each retry. This prevents overloading a struggling service.
Retries are excellent for intermittent problems that might fix themselves quickly. They make your agent more patient and persistent. You can often find pre-built solutions or design your own simple retry logic for your tools.
Timeout Handling
Have you ever waited forever for a webpage to load? That’s what happens when there’s no timeout handling. A tool might get stuck, waiting for a response that never comes. This can make your agent freeze and never finish its task.
Timeout handling sets a maximum amount of time a tool should wait for a response. If the tool doesn’t respond within that time, the agent stops waiting and considers it a failure. This prevents your agent from getting stuck indefinitely.
Setting sensible timeouts is crucial for keeping your agent responsive. It ensures that even if an external service is completely unresponsive, your agent can move on. This allows your agent to either retry, use a fallback, or report the problem.
Fallback Strategies
A fallback strategy is your agent’s backup plan. If an action fails even after retries and timeouts, what should the agent do next? This is where your fallback comes into play. It’s like having a plan B when plan A doesn’t work.
For example, if a tool meant to fetch real-time stock prices fails repeatedly, a fallback might be to provide the last known price. Another fallback could be to tell the user directly that the tool is unavailable. The simplest fallback is often to just return a helpful error message.
Having good fallback strategies ensures that your agent can always respond, even if it’s just to say “I’m sorry, I can’t do that right now.” This is much better than staying silent or crashing. You can even design a fallback tool that provides a default answer.
Implementing Error Handling in LangChain Agents
Now, let’s see how we can put these concepts into practice with LangChain agents. We’ll look at code examples to make these ideas concrete. You’ll learn how to wrap your tools with protective layers.
Catching Tool Errors
The most basic way to handle errors in Python (and thus in your LangChain tools) is using try-except blocks. This allows you to “try” to run some code and “catch” any errors that happen.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
from langchain.tools import Tool
import requests # A common library for making web requests
def get_weather_data(city: str) -> str:
"""Fetches current weather data for a given city."""
try:
# Example of a tool that might fail (e.g., network error, API key invalid)
api_key = "YOUR_WEATHER_API_KEY" # Replace with your actual API key
url = f"http://api.weatherapi.com/v1/current.json?key={api_key}&q={city}"
response = requests.get(url, timeout=5) # Added timeout here for safety
response.raise_for_status() # Raises an HTTPError for bad responses (4xx or 5xx)
data = response.json()
temp_c = data['current']['temp_c']
condition = data['current']['condition']['text']
return f"The current temperature in {city} is {temp_c}°C with {condition}."
except requests.exceptions.Timeout:
return f"Weather tool timed out while fetching data for {city}. Please try again later."
except requests.exceptions.RequestException as e:
# Catch any other request-related errors (network, connection, etc.)
return f"Could not fetch weather data for {city} due to a connection error: {e}"
except KeyError:
# If the API response structure changes or is unexpected
return f"Could not parse weather data for {city}. Unexpected API response."
except Exception as e:
# Catch any other unexpected errors
return f"An unexpected error occurred while fetching weather for {city}: {e}"
# Create a LangChain tool from the function
weather_tool = Tool(
name="WeatherFetcher",
func=get_weather_data,
description="Useful for fetching current weather information for a specified city. Input should be a city name.",
)
# Example of how an agent might use it
# If get_weather_data fails, it will return an error string instead of crashing.
print(weather_tool.run("London"))
print(weather_tool.run("InvalidCity"))
In this example, the try block attempts to get weather data. If anything goes wrong, an except block catches the specific error. You can see different except blocks for different types of errors, like a network issue (requests.exceptions.RequestException) or a problem with the data itself (KeyError). This way, you can give specific feedback about what went wrong.
Retrying Failed Tools
For errors that might be temporary, implementing retries is a smart move. You can write your own retry logic or use a library that handles it for you. The tenacity library is excellent for this in Python.
First, you might need to install it: pip install tenacity.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
from langchain.tools import Tool
import requests
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
# Define which exceptions should trigger a retry
retryable_exceptions = (requests.exceptions.ConnectionError, requests.exceptions.Timeout)
@retry(
stop=stop_after_attempt(3), # Try up to 3 times
wait=wait_exponential(multiplier=1, min=2, max=10), # Wait 2s, 4s, 8s, etc.
retry=retry_if_exception_type(retryable_exceptions),
reraise=True # Re-raise the last exception if all retries fail
)
def _fetch_data_with_retry(url: str, api_key: str, city: str):
"""Helper function to fetch data that can be retried."""
print(f"Attempting to fetch weather for {city}...")
full_url = f"http://api.weatherapi.com/v1/current.json?key={api_key}&q={city}"
response = requests.get(full_url, timeout=5)
response.raise_for_status() # This will raise an HTTPError for bad status codes
return response.json()
def get_weather_data_with_retry(city: str) -> str:
"""Fetches current weather data for a given city with retries."""
api_key = "YOUR_WEATHER_API_KEY" # Replace with your actual API key
try:
data = _fetch_data_with_retry(url="", api_key=api_key, city=city) # URL is constructed inside helper
temp_c = data['current']['temp_c']
condition = data['current']['condition']['text']
return f"The current temperature in {city} is {temp_c}°C with {condition}."
except retryable_exceptions as e:
return f"Weather tool failed after multiple retries for {city} due to connection issue: {e}"
except requests.exceptions.HTTPError as e:
# Handle non-retryable errors like 404 (city not found) or 401 (bad API key)
if e.response.status_code == 400:
return f"Could not find weather data for '{city}'. Please check the city name."
return f"Weather tool failed for {city} due to an API error ({e.response.status_code}): {e.response.text}"
except KeyError:
return f"Could not parse weather data for {city}. Unexpected API response."
except Exception as e:
return f"An unexpected error occurred while fetching weather for {city}: {e}"
# Create a LangChain tool from the function
weather_tool_retried = Tool(
name="WeatherFetcherWithRetry",
func=get_weather_data_with_retry,
description="Useful for fetching current weather information for a specified city, includes retries for temporary issues. Input should be a city name.",
)
# Example usage (you might need to temporarily disable network or use a bad API key to see retries in action)
print(weather_tool_retried.run("Paris"))
print(weather_tool_retried.run("NonExistentCity"))
Here, the @retry decorator from tenacity is added to our _fetch_data_with_retry helper function. It tells Python to retry if specific network errors happen. It will try up to 3 times, waiting longer each time, before finally giving up and raising the error. This helps with graceful degradation as the agent tries harder before giving up.
Setting Timeouts for Tools
You already saw a timeout=5 parameter in our requests.get call. This is a very common way to implement timeout handling for web requests. For other types of operations or custom tools, you might need different approaches.
If your tool involves running external commands, you can often pass a timeout argument.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import subprocess
from langchain.tools import Tool
def run_long_command(command_args: list) -> str:
"""Runs a command and limits its execution time."""
try:
# Timeout is set to 10 seconds.
# If the command takes longer, a TimeoutExpired error is raised.
result = subprocess.run(command_args, capture_output=True, text=True, check=True, timeout=10)
return result.stdout
except subprocess.TimeoutExpired:
return f"Command '{' '.join(command_args)}' timed out after 10 seconds."
except subprocess.CalledProcessError as e:
return f"Command failed with error: {e.stderr}"
except Exception as e:
return f"An unexpected error occurred: {e}"
long_command_tool = Tool(
name="LongCommandRunner",
func=run_long_command,
description="Executes a system command and stops if it runs too long. Input should be a list of strings for the command and its arguments, e.g., ['sleep', '15'] or ['ls', '-l'].",
)
# Example usage
print(long_command_tool.run(['ls', '-l'])) # Should work fine
print(long_command_tool.run(['sleep', '5'])) # Will wait for 5 seconds
print(long_command_tool.run(['sleep', '15'])) # This will trigger the timeout
By adding timeout to subprocess.run, we ensure that our agent won’t get stuck waiting forever. This is vital for tools that might interact with unreliable external processes or execute long-running computations. You are effectively safeguarding your agent’s responsiveness.
Implementing Fallback Tools/Responses
When all else fails (retries don’t work, timeouts are hit), a fallback strategy provides a default answer or action. This prevents the agent from just stopping or giving a confusing error.
You can create a simple fallback tool that always returns a predefined message. The agent can then be configured to use this tool if others fail.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from langchain.tools import Tool
def default_fallback_response(query: str = "") -> str:
"""A simple fallback tool that provides a helpful default message."""
return "I'm sorry, I couldn't complete your request using my tools right now. Please try rephrasing or ask me something else."
fallback_tool = Tool(
name="DefaultFallback",
func=default_fallback_response,
description="Provides a generic helpful response when other tools fail.",
)
# Example: How an agent *might* integrate a fallback (conceptual, agent orchestrates this)
# If weather_tool_retried fails for "NonExistentCity", the agent could then choose to use fallback_tool.
print(fallback_tool.run("Get weather for Mars"))
Within a LangChain agent setup, you would typically manage this at the agent’s logic level. You might have your agent catch tool exceptions and then, in its internal _handle_tool_error or similar method, decide to use the DefaultFallback tool. This ensures the agent always provides a coherent response.
Safety Patterns for LangChain Agents
Beyond just handling errors, we must ensure our agents are safe. This means protecting them from bad inputs, securing their interactions, and preventing misuse. These safety patterns are crucial for building agents that are reliable and trustworthy.
Input Validation
One of the most critical safety patterns is input validation. This means checking any information you give to your tools or agent to make sure it’s correct, expected, and safe. Imagine a calculator tool: if you give it “hello” instead of a number, it can’t work. More importantly, bad inputs can sometimes be malicious.
Malicious input prevention is about stopping harmful commands or data from reaching your tools. For example, if your tool interacts with a database, a malicious user might try to use “SQL injection” to steal or delete data. If your tool uses file paths, someone might try to access files they shouldn’t.
You should always assume that any input your agent receives, especially from users, could be trying to cause harm. Think of it like checking ingredients before cooking: you only want good, safe items.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
from langchain.tools import Tool
import re # For regular expressions, useful for pattern matching
def safe_calculator(expression: str) -> str:
"""A calculator tool with basic input validation."""
# Only allow numbers, basic math symbols, and spaces
# This prevents running arbitrary code or complex operations.
if not re.fullmatch(r"^[0-9+\-*/().\s]+$", expression):
return "Invalid input. Only numbers and basic math operations (+-*/) are allowed."
try:
# Use a safe evaluation method, like ast.literal_eval or a custom parser,
# rather than eval() which can be dangerous.
# For simplicity, we'll use eval() with extreme caution, but in real apps, AVOID eval().
# A better approach would be to parse the expression into an Abstract Syntax Tree (AST)
# and then evaluate only allowed operations.
# Example of a safer (though more complex) approach:
# import ast
# expr_node = ast.parse(expression, mode='eval')
# allowed_nodes = (ast.Expression, ast.BinOp, ast.Num, ast.Add, ast.Sub, ast.Mult, ast.Div)
# for node in ast.walk(expr_node):
# if not isinstance(node, allowed_nodes):
# return "Forbidden operation detected in expression."
# result = eval(expression) # Still risky, but illustrative.
# For this example, let's just stick to the regex and a safer (but not foolproof) approach
# A production-grade calculator would use a dedicated math library or AST parsing.
# For example, using `numexpr` or `sympy.parsing.mathematica.parse_expr`
# Since eval is dangerous, let's simulate a calculation after validation for simplicity
# For actual calculation, use a dedicated, safe math library!
return f"Result of {expression} is (simulated result for safety)."
except ZeroDivisionError:
return "Cannot divide by zero."
except Exception:
return "Error in calculation. Please check your expression."
calculator_tool_validated = Tool(
name="ValidatedCalculator",
func=safe_calculator,
description="A simple calculator that only allows basic arithmetic and validates input to prevent malicious commands. Input should be a mathematical expression.",
)
print(calculator_tool_validated.run("2+2"))
print(calculator_tool_validated.run("import os; os.system('rm -rf /')")) # Malicious input
print(calculator_tool_validated.run("10/0"))
In this example, the safe_calculator tool uses re.fullmatch to ensure the input expression only contains numbers and basic math symbols. This is a simple form of malicious input prevention. It directly rejects anything that looks suspicious. For more complex validation, especially for structured data, you can use powerful libraries like Pydantic. Pydantic is a great input validation library that lets you define exactly what your data should look like.
For more advanced security, you might consider getting a security best practices guide to understand common attack vectors and defenses.
Output Sanitization
Just as important as validating inputs is output sanitization. This means cleaning up any data that comes out of your tools before your agent or a user sees it. Sometimes, a tool might return messy, unformatted, or even potentially harmful information.
For instance, a web scraping tool might grab some JavaScript code along with the text it’s supposed to get. If your agent then shows this raw output to a user on a webpage, it could lead to a “Cross-Site Scripting” (XSS) attack. Sanitizing means removing anything that isn’t plain, safe text.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
from langchain.tools import Tool
from html import escape # Used to escape HTML characters
def sanitize_html_output(raw_html_content: str) -> str:
"""Sanitizes HTML content to prevent XSS attacks."""
# This is a very basic sanitization. For production, use a dedicated library
# like bleach or DOMPurify (for JS environments).
# Here, we're simply escaping HTML special characters.
escaped_content = escape(raw_html_content)
# You might also want to remove specific tags or attributes if they were permitted
# and now need to be stripped for a specific context.
# For a complete solution, you would typically specify allowed tags and attributes.
return escaped_content
def process_web_content(url: str) -> str:
"""Simulates fetching and sanitizing web content."""
# In a real scenario, this would involve a web request
# For demonstration, we'll use a sample malicious HTML string
malicious_html = f"<p>Hello, this is content from {url}.</p><script>alert('You have been hacked!');</script>"
# Sanitize the output before returning it
safe_content = sanitize_html_output(malicious_html)
return safe_content
web_content_processor_tool = Tool(
name="WebContentProcessor",
func=process_web_content,
description="Fetches web content and sanitizes it to prevent security issues. Input should be a URL.",
)
print(web_content_processor_tool.run("https://httpbin.org/html"))
In this snippet, html.escape turns characters like < into <, making them harmless text instead of active HTML code. This output sanitization prevents malicious code from being executed by a browser or misused by other systems. Always clean up data from unknown sources.
Rate Limit Handling
Many online services and APIs have rate limits. These limits are put in place to prevent people from making too many requests too quickly, which could overload the service. If your agent hits a rate limit, its tools will start failing.
Your agent needs to be smart about these limits. Instead of just retrying immediately (which will just hit the limit again), it should wait. This often involves slowing down or pausing its requests.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
from langchain.tools import Tool
import time
import random
import requests
# A simple class to simulate an API that has rate limits
class MockRateLimitedAPI:
def __init__(self, requests_per_minute=5):
self.requests_per_minute = requests_per_minute
self.request_times = []
def _check_rate_limit(self):
now = time.time()
# Remove requests older than 1 minute
self.request_times = [t for t in self.request_times if now - t < 60]
if len(self.request_times) >= self.requests_per_minute:
# Simulate a 429 Too Many Requests error
raise requests.exceptions.HTTPError("429 Client Error: Too Many Requests for url: mock-api", response=requests.Response())
self.request_times.append(now)
def fetch_data(self, item: str) -> str:
self._check_rate_limit()
time.sleep(random.uniform(0.1, 0.5)) # Simulate some processing time
return f"Data for {item} (fetched at {time.ctime()})"
mock_api = MockRateLimitedAPI(requests_per_minute=3) # Very strict limit for testing
def get_item_data_with_rate_limit_handling(item: str) -> str:
"""Fetches data from a rate-limited API with handling."""
max_retries = 5
for attempt in range(max_retries):
try:
return mock_api.fetch_data(item)
except requests.exceptions.HTTPError as e:
if "429 Client Error" in str(e):
wait_time = 2 ** attempt + random.uniform(0, 1) # Exponential backoff with jitter
print(f"Rate limit hit! Waiting {wait_time:.2f} seconds before retrying... (Attempt {attempt + 1})")
time.sleep(wait_time)
else:
return f"API error for {item}: {e}"
except Exception as e:
return f"An unexpected error occurred for {item}: {e}"
return f"Failed to fetch data for {item} after {max_retries} attempts due to rate limits."
rate_limited_tool = Tool(
name="RateLimitedDataFetcher",
func=get_item_data_with_rate_limit_handling,
description="Fetches specific item data from a rate-limited service, handles 429 errors gracefully. Input should be an item name.",
)
# Simulate hitting rate limits
for i in range(10):
print(rate_limited_tool.run(f"Item-{i+1}"))
This example shows a tool that tries to fetch data, but if it gets a “Too Many Requests” (429) error, it waits. It uses an exponential backoff strategy, meaning it waits longer with each failed attempt. This prevents the tool from repeatedly hitting the rate limit. You can also use rate limiting services or libraries to help manage this across many tools and agents.
Security Considerations
Beyond specific input/output issues, there are broader security considerations when building agents. These involve how your agent operates and what access it has.
- Least Privilege Principle: Your agent and its tools should only have the minimum permissions they need to do their job. If a tool only needs to read a file, it shouldn’t have permission to delete files. This minimizes the damage if a tool is compromised.
- Secure API Key Management: API keys and other secrets should never be hardcoded directly into your code. Use environment variables, secure vaults, or secret management services. Treat them like your house keys.
- Regular Security Audits: Regularly check your agent’s code and its dependencies for vulnerabilities. This is where security scanning tools can be very helpful. These tools automatically look for known weaknesses in your code.
- Penetration Testing: For critical applications, consider penetration testing tools or services. Penetration testing involves simulating cyberattacks to find weaknesses before real attackers do.
- Monitoring and Alerting: Keep an eye on your agent’s activity. If something unusual happens (like many failed tool calls or access attempts to sensitive areas), you should be alerted immediately.
These practices ensure your agent isn’t just functional but also resistant to attacks and misuse. Building with security in mind from the start is much easier than trying to fix it later.
Advanced Safety Patterns
As your LangChain agents become more complex and handle more important tasks, you’ll want to employ even more robust safety measures. These advanced patterns provide additional layers of protection and control.
Human-in-the-Loop
Sometimes, a task is too sensitive or complex for an agent to handle entirely on its own. This is where a human-in-the-loop approach comes in. It means the agent knows when to stop and ask for a human’s help or approval.
For example, if an agent is asked to perform a financial transaction, it might confirm the details with a user before proceeding. Or, if it encounters an error it can’t resolve, instead of giving a generic message, it could escalate the issue to a human operator.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from langchain.tools import Tool
def human_approval_tool(action_description: str) -> str:
"""A tool that requires human approval before proceeding."""
print(f"\n--- Human Intervention Required ---")
print(f"The agent wants to perform the following action: '{action_description}'")
# In a real application, this would involve sending a notification
# or displaying a UI prompt to a human user.
response = input("Do you approve this action? (yes/no): ").lower().strip()
if response == "yes":
print("Human approved. Proceeding...")
return "Action approved by human."
else:
print("Human denied. Aborting action.")
return "Action denied by human."
human_approval_agent_tool = Tool(
name="HumanApproval",
func=human_approval_tool,
description="Requires human approval for sensitive or critical actions. Input should be a description of the action.",
)
# Example usage within an agent's thought process
# An agent might decide to use this tool if its confidence is low
# or if the action is marked as high-risk.
print(human_approval_agent_tool.run("send an email to all clients saying 'hello'"))
print(human_approval_agent_tool.run("delete all marketing campaign data"))
This HumanApproval tool demonstrates how an agent can pause its work and ask for input. This is critical for preventing mistakes in high-stakes situations and helps build trust in your agent. You want to make sure your agent doesn’t do anything irreversible without proper oversight.
Monitoring and Alerting
Even with the best error handling and safety patterns, things can still go wrong in production. That’s why monitoring and alerting are essential. Monitoring means constantly watching your agent’s performance, resource usage, and error rates. Alerting means automatically notifying someone when critical issues are detected.
You should log all significant events: when tools are called, what inputs they receive, what outputs they return, and any errors that occur. These logs are invaluable for debugging and understanding your agent’s behavior.
When an error happens, especially one that wasn’t gracefully handled, an alert should be triggered. This could be an email, a Slack message, or a notification to a paging system. You want to know immediately if your agent stops working or behaves unexpectedly. This allows for quick detection and resolution of problems, minimizing downtime and negative impact.
Production Safety Frameworks
For large-scale, critical LangChain agent deployments, you might adopt production safety frameworks. These are established sets of practices, tools, and processes designed to ensure systems are reliable, secure, and performant in real-world use.
Concepts from Site Reliability Engineering (SRE) are highly relevant here. SRE focuses on applying software engineering principles to operations to create highly reliable and scalable systems. This includes error budgets, incident management, and post-mortems.
Adopting such frameworks helps you move beyond ad-hoc error handling to a systematic approach to operational safety. It makes your agent part of a well-managed, resilient ecosystem. You can learn more about these practices through reliability engineering courses or by exploring specific production safety frameworks.
Putting It All Together: A Comprehensive Example
Let’s imagine an agent designed to help with e-commerce support. It needs to fetch order details, calculate shipping, and maybe even look up product information. This agent would need robust error handling and safety.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
import requests
import time
import random
import json
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
from langchain.tools import Tool
# --- Affiliate Link Placeholders for context ---
# [Error Handling Best Practices Course](https://www.coursera.org/browse/computer-science/software-development)
# [Secure Input Validation Library](https://pydantic-docs.helpmanual.io/)
# [API Rate Limiting Service](https://redis.io/docs/data-types/probabilistic/count-min-sketch/)
# [Cybersecurity Best Practices Guide](https://owasp.org/)
# --- End Affiliate Link Placeholders ---
# 1. Mock External Services (with potential for errors/delays)
class MockOrderAPI:
def get_order_details(self, order_id: str):
if not order_id.isdigit():
raise ValueError("Order ID must be a number.")
order_id_int = int(order_id)
if order_id_int % 3 == 0: # Simulate a temporary network error
print(f"Simulating network error for order {order_id}...")
raise requests.exceptions.ConnectionError("Simulated network issue.")
if order_id_int % 5 == 0: # Simulate a 404 Not Found
print(f"Simulating 404 for order {order_id}...")
response = requests.Response()
response.status_code = 404
raise requests.exceptions.HTTPError(f"404 Not Found: Order {order_id} does not exist.", response=response)
time.sleep(random.uniform(0.1, 0.3)) # Simulate latency
return {"order_id": order_id, "customer": "John Doe", "items": ["Laptop", "Mouse"], "status": "Shipped"}
class MockShippingAPI:
def calculate_shipping(self, destination: str, weight_kg: float):
if not isinstance(destination, str) or not isinstance(weight_kg, (int, float)):
raise ValueError("Invalid input for shipping calculation.")
if weight_kg > 100: # Simulate a heavy package limit
raise ValueError("Package too heavy for standard shipping.")
time.sleep(random.uniform(0.2, 0.5))
return {"destination": destination, "cost": weight_kg * 5.0 + 10.0}
class MockProductAPI:
# This API will have a strict rate limit
def __init__(self, requests_per_minute=2):
self.requests_per_minute = requests_per_minute
self.request_times = []
def _check_rate_limit(self):
now = time.time()
self.request_times = [t for t in self.request_times if now - t < 60]
if len(self.request_times) >= self.requests_per_minute:
response = requests.Response()
response.status_code = 429
raise requests.exceptions.HTTPError("429 Client Error: Too Many Requests", response=response)
self.request_times.append(now)
def get_product_info(self, product_id: str):
self._check_rate_limit()
if not product_id.isalnum():
raise ValueError("Product ID must be alphanumeric.")
time.sleep(random.uniform(0.1, 0.4))
if product_id == "LAPTOP123":
return {"product_id": product_id, "name": "Super Laptop X", "price": 1200.00, "weight_kg": 2.5}
elif product_id == "MOUSEPRO":
return {"product_id": product_id, "name": "Pro Gaming Mouse", "price": 75.00, "weight_kg": 0.2}
else:
return None # Product not found
order_api = MockOrderAPI()
shipping_api = MockShippingAPI()
product_api = MockProductAPI(requests_per_minute=2)
# 2. Define Retryable Exceptions for tenacity
RETRYABLE_API_ERRORS = (requests.exceptions.ConnectionError, requests.exceptions.Timeout)
# 3. Enhanced Tool Definitions with Error Handling, Retries, Timeouts, Validation, and Fallbacks
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=1, max=5), reraise=True,
retry=retry_if_exception_type(RETRYABLE_API_ERRORS))
def _get_order_details_retried(order_id: str):
return order_api.get_order_details(order_id)
def get_customer_order_details(order_id: str) -> str:
"""
Fetches customer order details with retries for network issues and specific error handling.
Input should be a numerical order ID string.
"""
# Input Validation
if not order_id.isdigit():
return "Invalid Order ID. Please provide a numerical order ID."
try:
details = _get_order_details_retried(order_id)
# Output Sanitization (for JSON output, pretty printing is a form of sanitization too)
return json.dumps(details, indent=2)
except RETRYABLE_API_ERRORS:
return f"Could not retrieve order {order_id} due to network issues after multiple retries. Please try again later."
except requests.exceptions.HTTPError as e:
if e.response.status_code == 404:
return f"Order {order_id} not found. Please check the ID."
return f"An API error occurred for order {order_id}: {e}"
except ValueError as e:
return f"Error with order ID '{order_id}': {e}"
except Exception as e:
return f"An unexpected error occurred while fetching order {order_id}: {e}"
order_details_tool = Tool(
name="GetOrderDetails",
func=get_customer_order_details,
description="Fetches customer order details by order ID, handles network errors and missing orders. Input: order_id (string).",
)
def calculate_shipping_cost(destination: str, order_id: str) -> str:
"""
Calculates shipping cost for an order to a destination. Includes input validation and fallback.
Input should be a JSON string like '{"destination": "New York", "order_id": "123"}'
"""
try:
args = json.loads(destination)
dest = args.get("destination")
order_id_for_weight = args.get("order_id")
if not dest or not isinstance(dest, str):
raise ValueError("Missing or invalid 'destination' in input.")
if not order_id_for_weight or not isinstance(order_id_for_weight, str) or not order_id_for_weight.isdigit():
raise ValueError("Missing or invalid 'order_id' in input for weight calculation.")
# Try to get order details to determine weight, with error handling
order_info = order_api.get_order_details(order_id_for_weight) # Using internal mock_api for weight
# This part assumes a simplified weight calculation for demo.
# In a real scenario, you'd fetch product details and sum up weights.
total_weight_kg = 5.0 # Simplified fixed weight for demonstration
shipping_cost_info = shipping_api.calculate_shipping(dest, total_weight_kg)
return f"Shipping cost to {shipping_cost_info['destination']} for order {order_id_for_weight} is ${shipping_cost_info['cost']:.2f}."
except ValueError as e:
return f"Invalid input or calculation issue for shipping: {e}"
except requests.exceptions.ConnectionError:
return "Shipping calculation unavailable due to network issues. Using a fallback flat rate of $15." # Fallback strategy
except Exception as e:
return f"An unexpected error occurred during shipping calculation: {e}"
shipping_calc_tool = Tool(
name="CalculateShippingCost",
func=calculate_shipping_cost,
description="Calculates shipping cost for an order to a destination. Input: JSON string with 'destination' and 'order_id'.",
)
# This tool demonstrates rate limit handling and output sanitization
def get_product_information(product_id: str) -> str:
"""
Retrieves product information, handles rate limits, and ensures safe output.
Input should be a product ID string.
"""
# Input Validation
if not product_id.isalnum():
return "Invalid Product ID. Must be alphanumeric."
max_retries = 3
for attempt in range(max_retries):
try:
info = product_api.get_product_info(product_id)
if info:
# Output Sanitization: Ensuring JSON output is safe and readable
return json.dumps(info, indent=2)
else:
return f"Product '{product_id}' not found."
except requests.exceptions.HTTPError as e:
if "429 Client Error" in str(e):
wait_time = 2 ** attempt + random.uniform(0.1, 0.5) # Exponential backoff with jitter
print(f"RATE LIMIT HIT (Product API). Waiting {wait_time:.2f}s... (Attempt {attempt + 1})")
time.sleep(wait_time)
else:
return f"API error for product {product_id}: {e}"
except ValueError as e:
return f"Product API validation error: {e}"
except Exception as e:
return f"An unexpected error occurred fetching product {product_id}: {e}"
return f"Failed to get product information for '{product_id}' after {max_retries} attempts due to rate limits."
product_info_tool = Tool(
name="GetProductInfo",
func=get_product_information,
description="Retrieves detailed product information, handles rate limits. Input: product_id (string).",
)
# --- Demonstrating the tools with various scenarios ---
print("--- Testing GetOrderDetails Tool ---")
print(order_details_tool.run("123")) # Good order
print(order_details_tool.run("abc")) # Invalid input
print(order_details_tool.run("3")) # Simulates network error, should retry and eventually fail if actual error occurs
print(order_details_tool.run("5")) # Simulates 404 error
print("\n" + "="*30 + "\n")
print("--- Testing CalculateShippingCost Tool ---")
print(shipping_calc_tool.run('{"destination": "London", "order_id": "123"}'))
print(shipping_calc_tool.run('{"destination": 123, "order_id": "123"}')) # Invalid destination type
print(shipping_calc_tool.run('{"destination": "Berlin", "order_id": "invalid"}')) # Invalid order_id for weight
print("\n" + "="*30 + "\n")
print("--- Testing GetProductInfo Tool (Rate Limited) ---")
for i in range(5): # This will trigger rate limits
print(product_info_tool.run("LAPTOP123"))
time.sleep(0.5) # A small pause to make logs clearer, but not enough to avoid rate limit
print(product_info_tool.run("INVALID!PRODUCT")) # Invalid input
print("\n" + "="*30 + "\n")
This comprehensive example brings together many of the patterns we discussed. The get_customer_order_details tool has input validation, retry mechanisms for network issues, and specific error messages for graceful degradation. The calculate_shipping_cost tool demonstrates more input validation and a fallback strategy when a network error prevents exact calculation. Finally, get_product_information showcases rate limit handling with exponential backoff and output sanitization by ensuring valid JSON.
By carefully crafting each tool with these protective layers, you build an agent that is not just smart but also reliable and safe, even when external systems are temperamental or inputs are unexpected. You are ensuring that your agent can handle tool execution errors and protect itself from malicious input prevention while providing graceful degradation.
Learning More
Building robust and secure LangChain agents is an ongoing process. There’s always more to learn about making your systems resilient and safe.
To deepen your understanding of error handling strategies, consider an error handling course. These courses often cover advanced topics like circuit breakers, bulkheads, and distributed tracing. For ensuring your systems can handle failures and recover quickly, exploring reliability engineering courses can provide invaluable insights into building highly available applications.
You can also refer to the LangChain documentation for the latest features and best practices directly from the source. Additionally, staying updated on general software development principles around security and resilience will always benefit your agent development.
Conclusion
You’ve learned that building powerful LangChain agents means more than just giving them smart tools. It also means preparing them for when things inevitably go wrong. By implementing error handling and safety patterns, you make your agents much more dependable.
Remember to validate inputs, sanitize outputs, and build in ways to retry, timeout, and fallback when tools fail. These practices protect your agent from tool execution errors and malicious input prevention, leading to more graceful degradation. With these safety measures in place, you can trust your LangChain agents to perform their tasks effectively and securely.
Leave a comment