Outer Web | Explore

Explore >> Select a destination

You are here		richb.rice.edu Self-Consuming Generative Models Go MAD \|
\|	\|	blog.adnansiddiqi.me The Compliance Risks of Synthetic Data Generation \| Adnan's Random bytes	16.3 parsecs away Travel
\|	\|	What Is Synthetic Data? Synthetic data is machine-generated data based on real-world data. It requires building a machine learning (ML) model to capture the patterns in the original, real data before generating new synthetic data based on these patterns. The generated data accurately represents the original data's statistical distributions, patterns, and properties. Synthetic data is	16.3 parsecs away Travel
\|	\|	ssc.io Directions Towards Efficient and Automated Data Wrangling with Large Language Models \| Sebastian Schelter	16.3 parsecs away Travel
\|	\|	Data integration and cleaning have long been a key focus of the data management community. Recent research indicates the potential of large language models (LLMs) for such tasks. However, scaling and automating data wrangling with LLMs for real-world use cases poses additional challenges. Manual prompt engineering for example, is expensive and hard to operationalise, while full fine-tuning of LLMs incurs high compute and storage costs. Following up on previous work, we evaluate parameter-efficient fine-tuning (PEFT) methods for efficiently automating data wrangling with LLMs. We conduct a study of four popular PEFT methods on differently sized LLMs for ten benchmark tasks, where we find that PEFT methods achieve performance on-par with full fine-tuning, and that we can leverage small LLMs with negligible performance loss. However, even though such PEFT methods are parameter-efficient, they still incur high compute costs at training time and require labeled training data. We explore a zero-shot setting to further reduce deployment costs, and propose our vision for ZeroMatch, a novel approach to zero-shot entity matching. It is based on maintaining a large number of pretrained LLM variants from different domains and intelligently selecting an appropriate variant at inference time.	16.3 parsecs away Travel
\|	\|	blog.fastforwardlabs.com Seeing is not necessarily believing	17.3 parsecs away Travel
\|	\|	Advancements in machine learning have evolved to such an extent that machines can not only understand the input data but have also learned to create it. Generative models are one of the most promising approaches towards this goal. To train such a model we first collect a large amount of data (be it images, text, etc.) and then train a model to generate data like it. Generative Adversarial Networks (GANs) are one such class of generative models, that, given a training dataset, learn to generate new data with the same statistics as the training set.	17.3 parsecs away Travel
\|	\|	bike-lab.org Displacement and density - Bike Lab	40.7 parsecs away Travel
\|		After doing some crunching this week on data about rapidly-gentrifying Valencia Street in San Francisco, and finding that residential density is actually dropping in the neighborhood despite new housing construction, I wondered whether the same phenomenon could be found elsewhere in the country. I didn't have to wait long for another case study, as Lynda Lopez and a number of other peeps I follow from Chicago posted about Mayor Lori Lightfoot's ill-considered statement about "vibrancy" in Pilsen, a gentr...	40.7 parsecs away Travel