
WhyLabs AI Observatory: The Data and ML Monitoring Platform
Monitoring multi-agent LLM workflows has become reliable and protects PII in real time
What is our primary use case?
My main use case for WhyLabs was for LLM monitoring and observability. At that time, I had an AI application that I deployed on Vertex AI, and I used WhyLabs for the observability, logging, and monitoring of that application and the model.
I can provide a specific example of how I used WhyLabs for monitoring my LLM application. It was a multi-agent system with around four agents involved, and each agent had around seven or eight tools that it could use or invoke. Whenever a user sent a query to the main agent, its responsibility was to delegate the request among the other sub-agents. Each sub-agent could communicate with each other using the A2A protocol and call their tools. I monitored how the request progressed through the system. For instance, if a user sent a request to one agent, which then transferred it to a third agent, the third agent used a tool, and then it went to the seventh agent. I could easily monitor all this communication between the agents, the logging time, the request, the response, any errors, and any guardrails I wanted in my application in WhyLabs.
This was my only use case, and then WhyLabs got discontinued. WhyLabs was acquired by Apple in January or February 2025. The company then open-sourced their software so that anyone can use it. It is now open-source software available on GitHub where you can set it up yourself and use it.
What is most valuable?
WhyLabs's best features are real-time guardrails, PII personal information data detection, hallucination mitigation, and monitoring. It has a centralized dashboard so I can create a project and see an overall summary of the dashboards, and I can check the health metric on specific dates or specific times for WhyLabs or for the application. Additionally, it provides an alerting system. If there is an error or the system is down, it generates an alert via email.
Out of all those features, I find the PII detection and the monitoring most valuable in my day-to-day work because it is very hard to monitor an LLM application. As I mentioned earlier, it was a multi-agent system and a query can go from one agent to another agent very easily, which created problems in debugging how the request was progressing and how the data flow was happening. The monitoring and the PII detection of the guardrails are the three features most useful to me. Regarding the guardrails or the PII detection, if I do not want my PII data given to the agents or any LLM, this feature is particularly useful in that scenario.
WhyLabs has positively impacted my organization by reducing the error time and debugging time. It has increased and enhanced the user experience. When the application is down, I receive alerts, which has reduced a significant amount of time for my team.
What needs improvement?
Regarding how WhyLabs can be improved, since it is not available in the market as of now, improvements cannot be made to the product itself. However, there is an open-source version that anyone can set up on their machine and try to accomplish the same things.
I do not think there is anything else needed for improvement.
For how long have I used the solution?
I was using it in 2024 for around 1.5 years.
What was our ROI?
WhyLabs has saved my team time by 30 to 40%.
What other advice do I have?
Regarding WhyLabs's AI capabilities, I believe its governance and security are totally secured. It was deployed in our on-premises infrastructure, so all the data remains in our infrastructure only. The guardrails and the PII detection work perfectly. I have not seen any scenario where it has not generated an alert for PII data or the guardrails have not worked, so it performed very well.
In terms of WhyLabs's AI capabilities, I believe it is totally accurate. I used it for around 1.5 years, and it was the best software available, but it was discontinued. However, it was a very good software.
My advice to others considering WhyLabs is that as of now it is open-source, and you can set it up on your own machine for free and use it. It has very good features. I would rate this product a 10 out of 10.
Which deployment model are you using for this solution?
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
I built a monitoring solution with Whylabs for multiple ML models for a client of mine.
Model performance monitoring
Whylabs helped us setup end-to-end monitoring of our ML projects
* Tool allows for easy ingestion of big number of features and setting up initial monitoring on them
* We can use it to monitor both: input quality and model performance
* The alerts can be raised to specific group of users via specific channels (email/slack), which is helpful
* Some actions are not possible via UI and require specific API calls
* Documentation can be hard to navigate
Excellent tool for ML Monitoring with many out-of-the box solutions
Developed efficient solutions for optimizing ERP workflows through data analysis
Reliable AI Monitoring with Some Complexity
Self-Serve Observability Platform
whylogs seemed like the perfect choice for a consultant that clients did not want to entirely release their data to; I found that it only captures the profile and stats info instead of the raw data here.
Rcently, I started testing out LLM security features with LangKit and I cannot believe how quick it is to use. I followed a workshop few months ago that showed me how to detect jailbreak attempts and toxicity in LLM inputs and outputs using LangKit. Took that learning and now with a client's project, we have tested out logging the telemetary data from the evaluation to WhyLabs. Looks good so far, so once I upgrade the pricing limit for this client, we plan to scale our usage here. Excited about this one.
Top notch features at an affordable price
I will evaluate some dimensions of the tool that summarize my experience with it.
Easy Data Ingestion:
The ingestion API is straightforward to use and supports multiple connectors such as BigQuery, Databricks, and Spark, making data importation easy. Whylabs' use of Data Profiling ensures fast and secure data processing, eliminating the need to upload entire datasets, and making all the process very secure, since your data doesn't leave your servers.
Reliable Data Features:
Whylabs delivers all standard feature metrics accurately. Tracking data and model drift is very straightforward using Monitors.
Also, the platform supports custom metrics creation during or after ingestion.
Grouping by variables (segments) works well but must be defined during ingestion. Then you can analyze dataset features and track model performance per segment.
Flexible Monitors:
The monitoring system in Whylabs is highly adaptable and user-friendly, covering multiple variables with ease.
Monitors are easy to set up via the UI or JSON import, with summarized notifications for each monitor, keeping users informed without overwhelming them.
Additionally, monitors are JSON serializable, which is very helpful since you can track them with version control.
User-Friendly Usability:
Whylabs have a clean and intuitive UI, simplifying navigation for users.
While some advanced features may require programming knowledge, most tasks can be accomplished within the UI.
Thanks to data profiling, Whylabs delivers speedy performance without compromising on accuracy.
Solid Documentation:
The documentation provided by Whylabs is comprehensive and easy to understand, enabling users to make the most of the platform.
Pricing:
It's simply cheaper than its competition while having top notch features.
Customer Support:
They are always very helpful, answering all our questions and having several calls showcasing us different uses cases directly on the platform.
Overall, Whylabs offers a straightforward, efficient and affordable solution for monitoring Machine Learning models, with easy data ingestion, reliable feature analysis, and flexible monitoring options.
- Dashboards are in beta, and while functional, they lack polish in terms of user interface. They are working actively on this, so probably a few months after this review this may be already fixed.
- Defining groupings by variables must be done at ingestion time, limiting flexibility for post-ingestion analysis.
That being said, they are very open to feedback and they may change or add features based on your needs.
In our case, dashboards were important and they are working on them.