Even Google DeepMind's relabeled MedQA dataset, created for MedGemini in 2024, has flaws.
Many healthcare datasets/benchmarks contain dirty data because accuracy incentives are absent and few annotators are qualified.
We had to pay Stanford MDs to annotate 900 new questions to evaluate frontier models and will release these as open source on Hugging Face for anyone to use. They cover VQA and specialties like neurology, pediatrics, and psychiatry.
If labs want early access, please reach out. (Info in profile.) We are finalizing the dataset format.
Unlike general LLMs, where noise is tolerable and sometimes even desirable, training on incorrect/outdated information may cause clinical errors, misfolded proteins, or drugs with off-target effects.
Complicating matters, shifting medical facts may invalidate training data and model knowledge. What was true last year may be false today. For instance, in April 2024 the U.S. Preventive Services Task Force reversed its longstanding advice and now urges biennial mammograms starting at age 40 -- down from the previous benchmark of 50 -- for average-risk women, citing rising breast-cancer incidence in younger patients.
empiko 13 hours ago [-]
This is true for every subfield I have been working on for the past 10 years. The dirty secret of ML research is that Sturgeon's law apply to datasets as well - 90% of data out there is crap. I have seen NLP datasets with hundreds of citations that were obviously worthless as soon as you put the "effort" in and actually looked at the samples.
panabee 13 hours ago [-]
100% agreed. I also advise you not to read many cancer papers, particularly ones investigating viruses and cancer. You would be horrified.
(To clarify: this is not the fault of scientists. This is a byproduct of a severely broken system with the wrong incentives, which encourages publication of papers and not discovery of truth. Hug cancer researchers. They have accomplished an incredible amount while being handcuffed and tasked with decoding the most complex operating system ever designed.)
briandear 9 hours ago [-]
> this is not the fault of scientists. This is a byproduct of a severely broken system with the wrong incentives, which encourages publication of papers and not discovery of truth
Are scientists not writing those papers? There may be bad incentives, but scientists are responding to those incentives.
eszed 8 hours ago [-]
That is axiomatically true, but both harsh and useless, given that (as I understand from HN articles and comments) the choice is "play the publishing game as it is" vs "don't be a scientist anymore".
pyuser583 7 hours ago [-]
I agree, but there is an important side-effect of this statement: it's possible to criticize science, without criticizing scientists. Or at least without criticizing rank and file scientists.
There are many political issues where activists claim "the science has spoken." When critics respond by saying, "the science system is broken and is spitting out garbage", we have to take those claims very seriously.
That doesn't mean the science is wrong. Even though the climate science system is far from perfect, climate change is real and human made.
On the other hand, some of the science on gender medicine is not as established medical associates would have us believe (yet, this might change in a few years). But that doesn't stop reputable science groups from making false claims.
roughly 5 hours ago [-]
If we’re not going to hold any other sector of the economy personally responsible for responding to incentives, I don’t know why we’d start with scientists. We’ve excused folks working for Palantir around here - is it that the scientists aren’t getting paid enough for selling out, or are we just throwing rocks in glass houses now?
panabee 8 hours ago [-]
Valid critique, but one addressing a problem above the ML layer at the human layer. :)
That said, your comment has an implication: in which fields can we trust data if incentives are poor?
For instance, many Alzheimer's papers were undermined after journalists unmasked foundational research as academic fraud. Which conclusions are reliable and which are questionable? Who should decide? Can we design model architectures and training to grapple with this messy reality?
These are hard questions.
ML/AI should help shield future generations of scientists from poor incentives by maximizing experimental transparency and reproducibility.
Apt quote from Supreme Court Justice Louis Brandeis: "Sunlight is the best disinfectant."
jacobr1 7 hours ago [-]
Not a answer, but contributory idea - Meta-analysis. There are plenty of strong meta-analysis out there and one of the things they tend to end up doing is weighing the methodological rigour of the papers along with the overlap they have to the combined question being analyzed. Could we use this weighting explicitly in the training process?
panabee 5 hours ago [-]
Thanks. This is helpful. Looking forward to more of your thoughts.
Some nuance:
What happens when the methods are outdated/biased? We highlight a potential case in breast cancer in one of our papers.
Worse, who decides?
To reiterate, this isn’t to discourage the idea. The idea is good and should be considered, but doesn’t escape (yet) the core issue of when something becomes a “fact.”
edwardbernays 8 hours ago [-]
Scientists are responding to the incentives of a) wanting to do science, b) for the public benefit. There was one game in town to do this: the American public grant scheme.
This game is being undermined and destroyed by infamous anti-vaxxer, non-medical expert, non-public-policy expert RFK Jr.[1] The disastrous cuts to the NIH's public grant scheme is likely to amount to $8,200,000,000 ($8.2 trillion USD) in terms of years of life lost.[2]
So, should scientists not write those papers? Should they not do science for public benefit? These are the only ways to not respond to the structure of the American public grant scheme. It seems to me that, if we want better outcomes, then we should make incremental progress to the institutions surrounding the public grant scheme. This seems fair more sensible than installing Bobby Brainworms to burn it all down.
> This is true for every subfield I have been working on for the past 10 years
Hasn’t data labelling being the bulk of the work been true for every research endeavour since forever?
PaulHoule 8 hours ago [-]
If you download data sets for classification from Kaggle or CIFAR or search ranking from TREC it is the same. Typically 1-2% of judgements in that kind of dataset are just wrong so if you are aiming for the last few points of AUC you have to confront that.
morkalork 7 hours ago [-]
I still want to jump off a bridge whenever someone thinks they can use the twitter post and movie review datasets to train sentiment models for use in completely different contexts.
panabee 13 hours ago [-]
To elaborate, errors go beyond data and reach into model design. Two simple examples:
1. Nucleotides are a form of tokenization and encode bias. They're not as raw as people assume. For example, classic FASTA treats modified and canonical C as identical. Differences may alter gene expression -- akin to "polish" vs. "Polish".
2. Sickle-cell anemia and other diseases are linked to nucleotide differences. These single nucleotide polymorphisms (SNPs) mean hard attention for DNA matters and single-base resolution is non-negotiable for certain healthcare applications. Latent models have thrived in text-to-image and language, but researchers cannot blindly carry these assumptions into healthcare.
There are so many open questions in biomedical AI. In our experience, confronting them has prompted (pun intended) better inductive biases when designing other types of models.
We need way more people thinking about biomedical AI.
arbot360 7 hours ago [-]
> What was true last year may be false today. For instance, ...
Good example of a medical QA dataset shifting but not a good example of a medical "fact" since it is an opinion. Another way to think about shifting medical targets over time would be things like environmental or behavioral risk factors changing.
Anyways, thank you for putting this dataset together, certainly we need more third-party benchmarks with careful annotations done. I think it would be wise if you segregate tasks between factual observations of data, population-scale opinions (guidelines/recommendations), and individual-scale opinions (prognosis/diagnosis). Ideally there would be some formal taxonomy for this eventually like OMOP CDM, maybe there is already in some dusty corner of pubmed.
ethan_smith 2 hours ago [-]
Synthetic data generation techniques are increasingly being paired with expert validation to scale high-quality biomedical datasets while reducing annotation burden - especially useful for rare conditions where real-world examples are limited.
bjourne 11 hours ago [-]
What if there is significant disagreement within the medical profession itself? For example, isotretinoin is proscribed for acne in many countries, but in other countries the drug is banned or access restricted due to adverse side effects.
jacobr1 7 hours ago [-]
Would not one approach be to just ensure the system has all the data? Relevance to address systems, side effects, and legal constraints. Then when making a recommendations it can account for all factors not just prior use cases.
panabee 8 hours ago [-]
If you agree that ML starts with philosophy, not statistics, this is but one example highlighting how biomedicine helps model development, LLMs included.
Every fact is born an opinion.
This challenge exists in most, if not all, spheres of life.
K0balt 10 hours ago [-]
I think an often overlooked aspect of training data curation is the value of accurate but oblique data. Much of the “emergent capabilities “ of LLMs comes from data embedded in the data, implied or inferred semantic information that is not readily obvious. Extraction of this highly useful information, in contrast to specific factoids, requires a lot of off axis images of the problem space, like a CT scan of the field of interest. The value of adjacent oblique datasets should not be underestimated.
TZubiri 10 hours ago [-]
I noticed this when adding citations to wikipedia.
You are may find a definition of what a "skyscraper" is, by some hyperfocused association, but you'll get a bias towards a definite measurement like "skyscrapers are buildings between 700m to 3500m tall", which might be useful for some data mining project, but not at all what people mean by it.
The actual definition is not in a specific source but in the way it is used in other sources like "the Manhattan skyscraper is one of the most iconic skyscrapers", on the aggregate you learn what it is, but it isn't very citable on its own, which gives WP that pedantic bias.
Isn't labelling medical data for ai illegal as unlicensed medical practice?
Same thing with law data
mh- 8 hours ago [-]
No.
bethekidyouwant 9 hours ago [-]
Illegal?
iwontberude 8 hours ago [-]
Paralegals and medical assistants don’t need licenses
nomel 4 hours ago [-]
I think their question is a good one, and not being taken charitably.
Lets take the medical assistant example.
> Medical assistants are unlicensed, and may only perform basic administrative, clerical and technical supportive services as permitted by law.
If they're labelling data that's "tumor" or "not tumor", with any agency of the process,does that fit within their unlicensed scope? Or, would that labelling be closer to a diagnosis?
What if the AI is eventually used to diagnose, based on data that was labeled by someone unlicensed? Should there there need to be a "chain of trust" of some sort?
I think the answer to liability will be all on the doctor agreeing/disagreeing with the AI...for now.
SkyBelow 4 hours ago [-]
To answer this, I would think we should consider other cases where someone could practice medicine without legally doing so. For example, could they tutor a student and help them? Go through unknown cases and make judgement, explaining their reasoning? As long as they don't oversell their experience in a way that might be considered fraud, I don't think this would be practicing medicine.
It does open something of a loophole. Oh, I wasn't diagnosing a friend, I was helping him label a case just like his as an educational experience. My completely IANAL guess would be that judges would look on it based on how the person is doing it, primarily if they are receiving any compensation or running it like a business.
But wait... the example the OP was talking about is doing it like a business and likely doesn't have any disclaimers properly sent to the AI, so maybe that doesn't help us decide.
vidarh 13 hours ago [-]
I've done review and annotation work for two providers in this space, and so regularly get approached by providers looking for specialists with MSc's or PhD's...
"High-paid" is an exaggeration for many of these, but certainly a small subset of people will make decent money on it.
At one provider I was as an exception paid 6x their going rate because they struggled to get people skilled enough at the high-end to accept their regular rate, mostly to audit and review work done by others. I have no illusion I was the only one paid above their stated range. I got paid well, but even at 6x their regular rate I only got paid well because they estimated the number of tasks per hour and I was able to exceed that estimate by a considerable margin - if their estimate had matched my actual speed I'd have just barely gotten to the low end of my regular rate.
But it's clear there's a pyramid of work, and a sustained effort to create processes to allow the bulk of the work to be done by low-cost labellers, and then push smaller and smaller subsets of the data up more expensive to experts, as well as creating tooling to cut down the amount of time experts spend by e.g. starting with synthetic data (including model-generated reviews of model-generated responses).
I don't think I was at the top of that pyramid - the provider I did work for didn't handle many prompts that required deep specialist knowledge (though I did get to exercise my long-dormant maths and physics knowledge that doesn't say too much). I think most of what we addressed would at most need people with MSc level skills in STEM subjects. And so I'm sure there are a few more layers on the pyramid handling PhD-level complexity data. But from what I'm seeing from hiring managers contacting me, I get the impression the pay scale for them isn't that much higher (with the obvious caveat given what I mentioned above that there almost certainly are people getting paid high multiples on the stated scale)
Some of these pipelines of work are highly complex, often including multiple stages of reviews, sometimes with multiple "competing" annotators in parallel feeding into selection and review stages.
ljlolel 7 hours ago [-]
[dupe]
the_brin92 8 hours ago [-]
I've been doing this for one of the major companies in the space for a few years now. It has been interesting to watch how much more complex the projects have gotten over the last few years, and how many issues the models still have. I have a humanities background which has actually served me well here as what constitutes a "better" AI model response is often so subjective.
I can answer any questions people have about the experience (within code of conduct guidelines so I don't get in trouble...)
merksittich 5 hours ago [-]
Thank you, I'll bite. If within your code of conduct:
- Are you providing reasoning traces, responses or both?
- Are you evaluating reasoning traces, responses or both?
- Has your work shifted towards multi-turn or long horizon tasks?
- If you also work with chat logs of actual users, do you think that they are properly anonymized? Or do you believe that you could de-anonymize them without major efforts?
- Do you have contact to other evaluators?
- How do you (and your colleagues) feel about the work (e.g., moral qualms because "training your replacement" or proud because furthering civilization, or it's just about the money...)?
5 hours ago [-]
dbmikus 3 hours ago [-]
What kinds of data are you working on? Coding? Something else?
I've been curious how much these AI models look for more niche coding language expertise, and what other knowledge frontiers they're focusing on (like law, medical, finance, etc.)
mNovak 4 hours ago [-]
Curious how one gets involved in this, and what fields they're seeking?
TheAceOfHearts 14 hours ago [-]
It would be great if some of these datasets were free and opened up for public use. Otherwise it seems like you end up duplicating a lot of busywork just for multiple companies to farm more money. Maybe some of the European initiatives related to AI will end up including the creation of more open datasets.
Then again, maybe we're still operating from a framework where the dataset is part of your moat. It seems like such a way of thinking will severely limit the sources of innovation to just a few big labs.
andy_ppp 14 hours ago [-]
Why would companies paying top dollar to refine and create high quality datasets give them away for free?
This was published before anyone knew it running an AI company would be very very expensive.
some_random 10 hours ago [-]
I feel like that was by far the most predictable part of running an AI company.
delfinom 11 hours ago [-]
ClosedAI gonna ClosedAI
flir 13 hours ago [-]
Same reason they give open source contributions away for free. Hardware companies attempting to commoditize their complement. I think the org best placed to get strategic advantage from releasing high quality data sets might be Nvidia.
There are some good datasets for free though, eg HLE. Although I’m sure if they are marketing gimmicks
azemetre 5 hours ago [-]
Because we can make the government force them to.
KaiserPro 14 hours ago [-]
> operating from a framework where the dataset is part of your moat
Very much this. Its the dataset that shapes the model, the model is a product of the dataset, rather than the other way around (mind you, synthetic datasets are different...)
gexla 14 hours ago [-]
Right, and they pay a lot of money for this data. I know someone who does this, and one prompt evaluation could go through multiple rounds and reviews that could end up generating $150+ in payouts, and that's just what the workers receive. But that's not quite what the article is talking about. Each of these companies do things a bit different.
ripped_britches 8 hours ago [-]
Don’t worry - the labs will train based on this expert data and then everyone will just distill their models. Or, now that model itself can be an expert annotater.
illegalmemory 10 hours ago [-]
This could work with a Wikipedia-like model. It's very difficult to pull off, but a next-generation Wikipedia would look like this.
yorwba 2 hours ago [-]
I think it would be difficult to make that work, because Wikipedia has a direct way of converting users into contributors: you see something wrong, you edit the article, it's not wrong anymore.
Whereas if you do the same with machine learning training data, the influence is much more indirect and you may have to add a lot of data to fix one particular case, which is not very motivating.
NitpickLawyer 12 hours ago [-]
> Maybe some of the European initiatives related to AI will end up including the creation of more open datasets.
The EU has started the process of opening discussions aiming to set the stage for opportunities to arise on facilitating talks looking forward to identify key strategies of initiating cooperation between member states that will enable vast and encompassing meetings generating avenues of reaching top level multi-lateral accords on passing legislation covering the process of processing processes while preparing for the moment when such processes will become processable in the process of processing such processes.
#justeuthings :)
joshdavham 9 hours ago [-]
I was literally just reached out to this morning about a contract job for one of these “high quality datasets”. They specifically wanted python programmers who’ve contributed to popular repos (I maintain one repository with approx. 300 stars).
The rate they offered was between $50-90 per hour, so significantly higher than what I’d think low-cost data labellers are getting.
Needless to say, I marked them as spam though. Harvesting emails through GitHub is dirty imo. Was also sad that the recruiter was acting on behalf of a yc company.
apical_dendrite 8 hours ago [-]
The latest offer I saw was $150-$210 an hour for 20hrs/week. I didn't pursue it so I don't know if that's what people actually make, but it's an interesting data point.
antonvs 6 hours ago [-]
What kind of work was involved for that one? How specialist, I mean?
TrackerFF 8 hours ago [-]
I don’t know if it is related, but I’ve noticed an uptick in cold calls / approaches for consulting gigs related to data labeling and data QA, in my field (work as an analyst). I never got requests like that 2++ years ago.
htrp 4 hours ago [-]
Starting a data labeling company is the least AI way to get into AI.
glitchc 3 hours ago [-]
Some people sell shovels, others the grunts to use them.
The "bitter lesson" is from a paradigm where high quality data seemed infinite, and so we just need more and more compute to gobble up the data.
jsnider3 4 hours ago [-]
Not really.
rnxrx 5 hours ago [-]
It's only a matter of time until private enterprises figure out they can monetize a lot of otherwise useless datasets by tagging them and selling (likely via a broker) to organizations building models.
The implications for valuation of 'legacy' businesses are potentially significant.
htrp 4 hours ago [-]
Already happening.
some_random 4 hours ago [-]
Bad data has been such a huge problem in the industry for ages, honestly a huge portion of the worst bias (racism, sexism, etc) stems directly from low quality labelings.
quantum_state 12 hours ago [-]
It is expert system evolved …
cryptokush 12 hours ago [-]
welcome to macrodata refinement
verisimi 13 hours ago [-]
This is it - this is the answer to the ai takeover.
Get an ai to autogenerate lots of crap! Reddit, hn comments, false datasets, anything!
Cthulhu_ 8 hours ago [-]
That's just spam / more dead internet theory, and there will be or are companies that will curate data sets and filter out generated stuff / spam or hand-pick high quality data.
charlieyu1 12 hours ago [-]
I'll believe it when it happens. A major AI company got rid of an expert team last year because they think it is too expensive
techterrier 13 hours ago [-]
The latest in a long tradition, it used to be that you'd have to teach the offshore person how to do your job, so they could replace you for cheaper. Now we are just teaching the robots instead.
12 hours ago [-]
skeezyboy 12 hours ago [-]
[flagged]
kjkjadksj 6 hours ago [-]
Yeah I’ve avoided these job postings out of principle. I’m not going to be the one to contribute to obsoleting myself and my industry.
Melonololoti 15 hours ago [-]
Yepp it continues the gathering of more and better data.
Ai is not a hype. We have started to actually do something with all the data and this process will not stop soon.
Aline the RL what is now happening through human feedback alone (thumbs up/down) is massive.
KaiserPro 14 hours ago [-]
It was always the case. We only managed to make a decent model once we created a decent dataset.
This meant making a rich synthetic dataset first, to pre-train the model, before fine tuning on real, expensive data to get the best results.
but this was always the case.
noname120 3 hours ago [-]
RLHF wasn't needed for Deepseek, only gobbling up the whole internet — both good and bad stuff. See their paper
rtrgrd 12 hours ago [-]
I thought human preferences was typically considered a noisy reward signal
ACCount36 7 hours ago [-]
If it was just "noisy", you could compensate with scale. It's worse than that.
"Human preference" is incredibly fucking entangled, and we have no way to disentangle it and get rid of all the unwanted confounders. A lot of the recent "extreme LLM sycophancy" cases is downstream from that.
smohare 8 hours ago [-]
[dead]
Rendered at 22:39:17 GMT+0000 (Coordinated Universal Time) with Vercel.
Even Google DeepMind's relabeled MedQA dataset, created for MedGemini in 2024, has flaws.
Many healthcare datasets/benchmarks contain dirty data because accuracy incentives are absent and few annotators are qualified.
We had to pay Stanford MDs to annotate 900 new questions to evaluate frontier models and will release these as open source on Hugging Face for anyone to use. They cover VQA and specialties like neurology, pediatrics, and psychiatry.
If labs want early access, please reach out. (Info in profile.) We are finalizing the dataset format.
Unlike general LLMs, where noise is tolerable and sometimes even desirable, training on incorrect/outdated information may cause clinical errors, misfolded proteins, or drugs with off-target effects.
Complicating matters, shifting medical facts may invalidate training data and model knowledge. What was true last year may be false today. For instance, in April 2024 the U.S. Preventive Services Task Force reversed its longstanding advice and now urges biennial mammograms starting at age 40 -- down from the previous benchmark of 50 -- for average-risk women, citing rising breast-cancer incidence in younger patients.
(To clarify: this is not the fault of scientists. This is a byproduct of a severely broken system with the wrong incentives, which encourages publication of papers and not discovery of truth. Hug cancer researchers. They have accomplished an incredible amount while being handcuffed and tasked with decoding the most complex operating system ever designed.)
Are scientists not writing those papers? There may be bad incentives, but scientists are responding to those incentives.
There are many political issues where activists claim "the science has spoken." When critics respond by saying, "the science system is broken and is spitting out garbage", we have to take those claims very seriously.
That doesn't mean the science is wrong. Even though the climate science system is far from perfect, climate change is real and human made.
On the other hand, some of the science on gender medicine is not as established medical associates would have us believe (yet, this might change in a few years). But that doesn't stop reputable science groups from making false claims.
That said, your comment has an implication: in which fields can we trust data if incentives are poor?
For instance, many Alzheimer's papers were undermined after journalists unmasked foundational research as academic fraud. Which conclusions are reliable and which are questionable? Who should decide? Can we design model architectures and training to grapple with this messy reality?
These are hard questions.
ML/AI should help shield future generations of scientists from poor incentives by maximizing experimental transparency and reproducibility.
Apt quote from Supreme Court Justice Louis Brandeis: "Sunlight is the best disinfectant."
Some nuance:
What happens when the methods are outdated/biased? We highlight a potential case in breast cancer in one of our papers.
Worse, who decides?
To reiterate, this isn’t to discourage the idea. The idea is good and should be considered, but doesn’t escape (yet) the core issue of when something becomes a “fact.”
This game is being undermined and destroyed by infamous anti-vaxxer, non-medical expert, non-public-policy expert RFK Jr.[1] The disastrous cuts to the NIH's public grant scheme is likely to amount to $8,200,000,000 ($8.2 trillion USD) in terms of years of life lost.[2]
So, should scientists not write those papers? Should they not do science for public benefit? These are the only ways to not respond to the structure of the American public grant scheme. It seems to me that, if we want better outcomes, then we should make incremental progress to the institutions surrounding the public grant scheme. This seems fair more sensible than installing Bobby Brainworms to burn it all down.
[1] https://youtu.be/HqI_z1OcenQ?si=ZtlffV6N1NuH5PYQ
[2] https://jamanetwork.com/journals/jama-health-forum/fullartic...
Hasn’t data labelling being the bulk of the work been true for every research endeavour since forever?
1. Nucleotides are a form of tokenization and encode bias. They're not as raw as people assume. For example, classic FASTA treats modified and canonical C as identical. Differences may alter gene expression -- akin to "polish" vs. "Polish".
2. Sickle-cell anemia and other diseases are linked to nucleotide differences. These single nucleotide polymorphisms (SNPs) mean hard attention for DNA matters and single-base resolution is non-negotiable for certain healthcare applications. Latent models have thrived in text-to-image and language, but researchers cannot blindly carry these assumptions into healthcare.
There are so many open questions in biomedical AI. In our experience, confronting them has prompted (pun intended) better inductive biases when designing other types of models.
We need way more people thinking about biomedical AI.
Good example of a medical QA dataset shifting but not a good example of a medical "fact" since it is an opinion. Another way to think about shifting medical targets over time would be things like environmental or behavioral risk factors changing.
Anyways, thank you for putting this dataset together, certainly we need more third-party benchmarks with careful annotations done. I think it would be wise if you segregate tasks between factual observations of data, population-scale opinions (guidelines/recommendations), and individual-scale opinions (prognosis/diagnosis). Ideally there would be some formal taxonomy for this eventually like OMOP CDM, maybe there is already in some dusty corner of pubmed.
Every fact is born an opinion.
This challenge exists in most, if not all, spheres of life.
You are may find a definition of what a "skyscraper" is, by some hyperfocused association, but you'll get a bias towards a definite measurement like "skyscrapers are buildings between 700m to 3500m tall", which might be useful for some data mining project, but not at all what people mean by it.
The actual definition is not in a specific source but in the way it is used in other sources like "the Manhattan skyscraper is one of the most iconic skyscrapers", on the aggregate you learn what it is, but it isn't very citable on its own, which gives WP that pedantic bias.
Same thing with law data
Lets take the medical assistant example.
> Medical assistants are unlicensed, and may only perform basic administrative, clerical and technical supportive services as permitted by law.
If they're labelling data that's "tumor" or "not tumor", with any agency of the process,does that fit within their unlicensed scope? Or, would that labelling be closer to a diagnosis?
What if the AI is eventually used to diagnose, based on data that was labeled by someone unlicensed? Should there there need to be a "chain of trust" of some sort?
I think the answer to liability will be all on the doctor agreeing/disagreeing with the AI...for now.
It does open something of a loophole. Oh, I wasn't diagnosing a friend, I was helping him label a case just like his as an educational experience. My completely IANAL guess would be that judges would look on it based on how the person is doing it, primarily if they are receiving any compensation or running it like a business.
But wait... the example the OP was talking about is doing it like a business and likely doesn't have any disclaimers properly sent to the AI, so maybe that doesn't help us decide.
"High-paid" is an exaggeration for many of these, but certainly a small subset of people will make decent money on it.
At one provider I was as an exception paid 6x their going rate because they struggled to get people skilled enough at the high-end to accept their regular rate, mostly to audit and review work done by others. I have no illusion I was the only one paid above their stated range. I got paid well, but even at 6x their regular rate I only got paid well because they estimated the number of tasks per hour and I was able to exceed that estimate by a considerable margin - if their estimate had matched my actual speed I'd have just barely gotten to the low end of my regular rate.
But it's clear there's a pyramid of work, and a sustained effort to create processes to allow the bulk of the work to be done by low-cost labellers, and then push smaller and smaller subsets of the data up more expensive to experts, as well as creating tooling to cut down the amount of time experts spend by e.g. starting with synthetic data (including model-generated reviews of model-generated responses).
I don't think I was at the top of that pyramid - the provider I did work for didn't handle many prompts that required deep specialist knowledge (though I did get to exercise my long-dormant maths and physics knowledge that doesn't say too much). I think most of what we addressed would at most need people with MSc level skills in STEM subjects. And so I'm sure there are a few more layers on the pyramid handling PhD-level complexity data. But from what I'm seeing from hiring managers contacting me, I get the impression the pay scale for them isn't that much higher (with the obvious caveat given what I mentioned above that there almost certainly are people getting paid high multiples on the stated scale)
Some of these pipelines of work are highly complex, often including multiple stages of reviews, sometimes with multiple "competing" annotators in parallel feeding into selection and review stages.
I can answer any questions people have about the experience (within code of conduct guidelines so I don't get in trouble...)
- Are you providing reasoning traces, responses or both?
- Are you evaluating reasoning traces, responses or both?
- Has your work shifted towards multi-turn or long horizon tasks?
- If you also work with chat logs of actual users, do you think that they are properly anonymized? Or do you believe that you could de-anonymize them without major efforts?
- Do you have contact to other evaluators?
- How do you (and your colleagues) feel about the work (e.g., moral qualms because "training your replacement" or proud because furthering civilization, or it's just about the money...)?
I've been curious how much these AI models look for more niche coding language expertise, and what other knowledge frontiers they're focusing on (like law, medical, finance, etc.)
Then again, maybe we're still operating from a framework where the dataset is part of your moat. It seems like such a way of thinking will severely limit the sources of innovation to just a few big labs.
[1] https://web.archive.org/web/20190224031626/https://blog.open...
This was published before anyone knew it running an AI company would be very very expensive.
https://huggingface.co/nvidia/datasets?sort=most_rows
Very much this. Its the dataset that shapes the model, the model is a product of the dataset, rather than the other way around (mind you, synthetic datasets are different...)
Whereas if you do the same with machine learning training data, the influence is much more indirect and you may have to add a lot of data to fix one particular case, which is not very motivating.
The EU has started the process of opening discussions aiming to set the stage for opportunities to arise on facilitating talks looking forward to identify key strategies of initiating cooperation between member states that will enable vast and encompassing meetings generating avenues of reaching top level multi-lateral accords on passing legislation covering the process of processing processes while preparing for the moment when such processes will become processable in the process of processing such processes.
#justeuthings :)
The rate they offered was between $50-90 per hour, so significantly higher than what I’d think low-cost data labellers are getting.
Needless to say, I marked them as spam though. Harvesting emails through GitHub is dirty imo. Was also sad that the recruiter was acting on behalf of a yc company.
http://www.incompleteideas.net/IncIdeas/BitterLesson.html
The implications for valuation of 'legacy' businesses are potentially significant.
Get an ai to autogenerate lots of crap! Reddit, hn comments, false datasets, anything!
Ai is not a hype. We have started to actually do something with all the data and this process will not stop soon.
Aline the RL what is now happening through human feedback alone (thumbs up/down) is massive.
This meant making a rich synthetic dataset first, to pre-train the model, before fine tuning on real, expensive data to get the best results.
but this was always the case.
"Human preference" is incredibly fucking entangled, and we have no way to disentangle it and get rid of all the unwanted confounders. A lot of the recent "extreme LLM sycophancy" cases is downstream from that.