A collaboration between journalists and technologists from multiple news organizations (Hearst, Gannett, The Globe and Mail, and E24) developed an AI system to automatically detect newsworthy real estate transactions. The system combines anomaly detection, LLM-based analysis, and human feedback to identify significant property transactions, with a particular focus on celebrity involvement and price anomalies. Early results showed promise with few-shot prompting, and the system successfully identified several newsworthy transactions that might have otherwise been missed by traditional reporting methods.
The Real Estate Alerter is a collaborative project developed as part of a journalism fellowship, bringing together journalists and technologists from multiple news organizations: Hearst and Gannett in the United States, The Globe and Mail in Canada, and e24 in Norway. The project aims to solve a fundamental challenge in real estate journalism: with thousands of property transactions occurring daily, how can newsrooms systematically identify the ones that are actually worth covering?
The origin story illustrates the problem well. A Detroit Free Press editor, Randy Essex, happened to spot a “For Sale” sign for a $2 million property near Michigan train station while out running. He realized this was a significant story that no one had picked up. The resulting article became one of the most-read stories that week, prompting team member Annette to question whether there might be a more efficient approach than physically running around the city to find newsworthy transactions.
The Real Estate Alerter employs a multi-stage pipeline that combines traditional data science techniques with LLM-based reasoning. The architecture consists of several key components working together:
Anomaly Detection Layer: The first stage uses domain knowledge, statistical analysis, and clustering algorithms to identify outliers within the transaction dataset. This preprocessing step is crucial because it dramatically reduces the volume of data that needs to be processed by the more computationally expensive LLM stage. Rather than asking the LLM to evaluate every transaction, the system first filters down to transactions that are statistically unusual in some way.
Data Preprocessing and Enrichment: Structured transaction data from databases is transformed into natural language format to be more easily digestible by the LLM. The team also enriches this data with external contextual information, such as details about geographical areas mentioned in the database, what those areas are known for, and what would be considered unusual about them. This contextual enrichment helps the LLM make more informed decisions about newsworthiness.
LLM-Based Newsworthiness Assessment: The core of the system relies on an LLM to determine which of the identified outliers are actually newsworthy. The team employs few-shot prompting to direct the LLM’s behavior, providing examples that illustrate what constitutes a newsworthy transaction. This is a critical design decision that acknowledges the LLM needs guidance—it cannot independently determine what journalists consider newsworthy.
Famous Person Detection: A notable feature is the integration of named entity recognition on archive data to extract and maintain a sorted list of famous people. This list enriches the feature set used for newsworthiness detection. The team uses Wikidata to resolve challenges like matching stage names to legal names (e.g., querying “Lady Gaga” returns her legal name) and obtaining dates of birth to disambiguate common names (e.g., distinguishing a famous “James Smith” from many others with the same name). In Norway, they also leverage tax registry data for date of birth matching.
A significant aspect of the LLMOps approach is the embedded human feedback mechanism. The team explicitly built the system to incorporate human judgment for several purposes:
False Positive Handling: When the LLM flags a transaction as newsworthy but journalists determine it is not (such as the example of a property transaction flagged because “Christian Erikson” appeared—a famous Danish footballer’s name that also happens to be a common Norwegian name), users can provide feedback through the dashboard. This feedback is collected for later fine-tuning of the system.
False Negative Review: Recognizing that LLMs might miss genuinely newsworthy transactions, the team built a filter for reviewing transactions deemed “non-newsworthy” by the system. Journalists can quickly scan these to catch any stories that were incorrectly filtered out.
Cold Start Problem: The team candidly acknowledges the chicken-and-egg problem: they needed to start the system before having human feedback data to train on. Their solution was to conduct interviews with real estate reporters, presenting them with feature lists and asking them to weight features based on how much they contribute to newsworthiness. This qualitative research informed their initial few-shot prompting approach.
The team emphasizes that answering “what makes a transaction newsworthy?” was both the central challenge and the key to success. They discovered that this question doesn’t have a universal answer:
The team made a pragmatic decision to focus on two features that emerged clearly from their reporter interviews: transaction price and involvement of famous persons. These became the primary signals for their few-shot prompting, with a rule-based backup algorithm in place to catch transactions meeting these criteria in case the LLM misses them.
Interestingly, they observed that as more human feedback accumulated, the LLM began missing fewer of these clearly newsworthy transactions, suggesting the human-in-the-loop approach was working.
Noisy Data: When the team gained access to a richer dataset, they initially expected better results. Instead, they faced significant data cleaning challenges, and the results weren’t proportionally better. Their hypothesis is that more complex feature sets require more sophisticated prompting or fine-tuning to avoid confusing the LLM.
Name Matching Complexity: The famous person detection feature faces multiple challenges:
Context-Dependent Rules: The example of ground contamination is instructive for LLMOps practitioners. The LLM initially flagged contaminated properties as newsworthy—a reasonable assumption in many contexts—but this was actually common in the Oslo dataset and not particularly newsworthy. This illustrates why LLMs need domain-specific direction rather than relying on general world knowledge.
The system delivers alerts via Slackbot integration, notifying journalists when new newsworthy transactions are detected. Users can then access a dashboard to review transactions, see the LLM’s description of why each transaction was flagged as newsworthy, and filter by criteria like celebrity involvement, price per square meter, or total transaction price.
The team is transparent that the system is still in prototype stage—described candidly as “tape and glue”—and not yet fully operational in newsrooms. Several future improvements are planned:
The team notes significant variation in data accessibility by geography. Norway has open data that the team purchases as daily transaction lists from a company called Ambita. In the US, data is available but typically at the state level and more expensive. In Canada, data access is particularly difficult. The team hopes that demonstrating journalistic impact with this tool might help advocate for greater data openness.
This case study represents an honest, early-stage project rather than a polished production deployment. The team’s transparency about challenges, the “tape and glue” state of the system, and the need for continued improvement is refreshing. The approach of combining statistical anomaly detection with LLM-based reasoning and human feedback loops represents a sensible architecture for this type of application. However, the project has not yet demonstrated production-scale results, and many of the claimed benefits remain aspirational rather than proven. The example of catching a cross-country skiing star’s property sale “with test data” (meaning they would have caught it, but didn’t actually have the system running at the time) illustrates the gap between prototype promise and operational reality.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
This comprehensive case study examines how multiple enterprises (Autodesk, KPMG, Canva, and Lightspeed) are deploying AI agents in production to transform their go-to-market operations. The companies faced challenges around scaling AI from proof-of-concept to production, managing agent quality and accuracy, and driving adoption across diverse teams. Using the Relevance AI platform, these organizations built multi-agent systems for use cases including personalized marketing automation, customer outreach, account research, data enrichment, and sales enablement. Results include significant time savings (tasks taking hours reduced to minutes), improved pipeline generation, increased engagement rates, faster customer onboarding, and the successful scaling of AI agents across multiple departments while maintaining data security and compliance standards.
OpenAI's Forward Deployed Engineering (FDE) team, led by Colin Jarvis, embeds with enterprise customers to solve high-value problems using LLMs and deliver production-grade AI applications. The team focuses on problems worth tens of millions to billions in value, working with companies across industries including finance (Morgan Stanley), manufacturing (semiconductors, automotive), telecommunications (T-Mobile, Klarna), and others. By deeply understanding customer domains, building evaluation frameworks, implementing guardrails, and iterating with users over months, the FDE team achieves 20-50% efficiency improvements and high adoption rates (98% at Morgan Stanley). The approach emphasizes solving hard, novel problems from zero-to-one, extracting learnings into reusable products and frameworks (like Swarm and Agent Kit), then scaling solutions across the market while maintaining strategic focus on product development over services revenue.