 Your new post is loading...
 Your new post is loading...
|
Scooped by
Gilbert C FAURE
October 13, 2013 8:40 AM
|
is a personal Notebook  Thanks John Dudley for the following tweet "If you like interesting snippets on all sorts of subjects relevant to academia, information, the world, highly recommended is @grip54 's collection:"  La curation de contenus, la mémoire partagée d'une veille scientifique et sociétale
|
Scooped by
Gilbert C FAURE
Today, 11:28 AM
|
Earth at Night, illuminated by full moon - a dark sphere, with editing or longer exposure transformed to pale blue, show bright spots which are cities at night across iberian peninsula and Africa's mediterranean sea coast line & South America on the right.
Dark shot: https://lnkd.in/d6gmWgiF
Colorful shot https://lnkd.in/dr_4iEEm
#earthatnight #artemisII
|
Scooped by
Gilbert C FAURE
April 3, 11:38 AM
|
đȘđ”đ đđ”đŒđŒđđ¶đ»đŽ đđČđđđČđČđ» đđ°đđ¶đŒđ» đ„đČđđČđźđżđ°đ” đźđ»đ± đđźđđČ đŠđđđ±đ đđźđ» đ đźđžđČ đŒđż đđżđČđźđž đŹđŒđđż đ§đ”đČđđ¶đ.
Many graduate students weaken their thesis by confusing đźđ°đđ¶đŒđ» đżđČđđČđźđżđ°đ” with đ°đźđđČ đđđđ±đâyet the two serve fundamentally different academic purposes.
đđ°đđ¶đŒđ» đżđČđđČđźđżđ°đ” is initiated to solve an đ¶đșđșđČđ±đ¶đźđđČ đœđżđŒđŻđčđČđș It focuses on đ¶đșđœđčđČđșđČđ»đđ¶đ»đŽ đđŒđčđđđ¶đŒđ»đ, often within the đłđ¶đČđčđ± đŒđł đČđ±đđ°đźđđ¶đŒđ», where researchers may also đźđ°đ đźđ đœđźđżđđ¶đ°đ¶đœđźđ»đđ in the research process. This approach is practical, intervention-based, and solution-oriented.
đđźđđČ đđđđ±đ, by contrast, involves đ¶đ»-đ±đČđœđđ” đźđ»đźđčđđđ¶đ of a đœđźđżđđ¶đ°đđčđźđż đČđđČđ»đ đŒđż đ°đźđđČ đŒđđČđż đź đčđŒđ»đŽ đœđČđżđ¶đŒđ± đŒđł đđ¶đșđČ. It emphasizes đŒđŻđđČđżđđ¶đ»đŽ đźđ»đ± đźđ»đźđčđđđ¶đ»đŽ đź đđ¶đđđźđđ¶đŒđ», is đđđČđ± đ¶đ» đșđźđ»đ đłđ¶đČđčđ±đ, and đ±đŒđČđ đ»đŒđ đœđżđŒđđ¶đ±đČ đź đđŒđčđđđ¶đŒđ» đđŒ đź đœđżđŒđŻđčđČđș. Researchers typically đ±đŒ đ»đŒđ đđźđžđČ đœđźđżđ in the research setting.
Misunderstanding this distinction leads to flawed methodology, weak research design, and inconsistent findingsâcommon issues in rejected proposals.
đČ If you need thesis help, WhatsApp DocAdeson on: +14243487554
â»ïž find this useful? follow + like + repost + comment.
#DrAdeson #AcademicResearch #ResearchMatters #ResearchCommunity #AcademicWriting #PhDLife #PostdocLife #GradSchool
|
Scooped by
Gilbert C FAURE
April 3, 11:29 AM
|
LLM are ok for medical diagnoses, but AI chatbots for public are not. LLMs complete the scenarios accurately, correctly identifying conditions in 94.9% of cases and disposition in 56.3% on average. However, participants using the same LLMs identified relevant conditions in fewer than 34.5% of cases and disposition in fewer than 44.2%, both no better than the control group.
https://lnkd.in/gRCrBSkE
#LLM #AI
|
Scooped by
Gilbert C FAURE
April 2, 1:18 PM
|
Reddit veut scanner votre oeil pour vous laisser écrire un commentaire. On dirait de la science-fiction. Mais non. C'est une annonce officielle du 25 mars 2026.
Reddit a un problÚme. Des millions de faux comptes automatisés envahissent la plateforme. Des programmes qui postent, commentent, likent à la place de vrais utilisateurs. Reddit en supprime 100 000 par jour. Et ça ne suffit plus. Digg, son ancien concurrent, vient de fermer, submergé par les machines.
La solution ? Demander aux comptes suspects de prouver qu'un humain se cache derriĂšre. Avec du Face ID, des passkeys, ou mĂȘme World ID, un systĂšme qui scanne votre iris.
Le dĂ©tail Ă connaĂźtre : World ID, c'est un projet cofondĂ© par Sam Altman, le CEO d'OpenAI. Le mĂȘme Sam Altman qui a investi plus de 60 millions de dollars dans Reddit, qui a siĂ©gĂ© Ă son conseil d'administration pendant sept ans, et qui possĂšde plus d'actions que le CEO de la plateforme. Sa participation vaut plus d'un milliard de dollars. LĂ©ger conflit d'intĂ©rĂȘt.
Nous voilà donc coincés entre deux risques : laisser mourir l'internet ouvert sous les faux comptes, ou le sauver en confiant nos données biométriques au propriétaire de ChatGPT.
Je n'ai pas la solution. Mais si la seule façon de prouver qu'on est humain, c'est de sacrifier son anonymat, alors on a un sérieux problÚme de conception. | 36 comments on LinkedIn
|
Scooped by
Gilbert C FAURE
April 1, 10:39 AM
|
Not all evidence is created equalâand in public health, that distinction saves lives.
Understanding clinical study designs is the foundation of evidence-based decision-making:
1. Observational studies (cohort, case-control, cross-sectional) help us detect patterns, associations, and disease burdenâcritical for surveillance and hypothesis generation. 2. Experimental studies (randomized vs non-randomized) go a step furtherâestablishing causality through controlled intervention.
But hereâs the nuance many overlook:
âïž Cohort studies track exposure â outcome (powerful for incidence & risk) âïž Case-control studies work backward (efficient for rare diseases) âïž Cross-sectional studies provide snapshots (essential for prevalence) âïž Randomized trials minimize bias and remain the gold standardâbut are not always feasible in real-world public health settings
The real expertise lies not in choosing the âbestâ designâbut in choosing the right design for the right question.
In an era of data abundance and rapid policy decisions, strengthening our understanding of study designs is not optionalâit is a professional responsibility.
#Epidemiology #PublicHealth #EvidenceBasedPractice #ClinicalResearch #DataScience #GlobalHealth #ResearchMethods
|
Scooped by
Gilbert C FAURE
April 1, 6:12 AM
|
Introducing my new white paper: The myth of the academic superstar - or why name disambiguation is crucial
|
Scooped by
Gilbert C FAURE
April 1, 3:59 AM
|
What should your organization know about the FedNow Service, an instant payment infrastructure developed by the Federal Reserve?
These resources from Federal Reserve Financial Services are a great starting point. Get up to speed on the basicsâŻaboutâŻthese innovative payments, how they work and the benefits they offer: https://bit.ly/47zdgnJ
|
Scooped by
Gilbert C FAURE
April 1, 3:57 AM
|
More than 9,000 researchers published at least 72 papers in a single year - more than one paper every five days - in one or more of the years between 2019 and 2024.
When a âmore conservative thresholdâ of 40 papers a year was applied, so called "hyper-prolific authors" increased in number by 66 per cent, from 2,517 in 2019 to 4,189 in 2023, a 2025 study found, against a wider increase in publications of 15 per cent over that period.
Last year, Clarivate excluded 432 authors from its latest Highly Cited Researchers list in response to concerns over âextreme levels of publication relative to field baselinesâ.
There is also a âgrowing trend of multiple institutional affiliationsâ, often across different countries, with âsome authors listing affiliations with more than 20 institutionsâ.
Concerns abound that authors who publish on a weekly basis are cutting corners, corrupting authorship norms and overburdening the peer review system â with AI likely to make matters worse. But if incentives are misaligned, what can be done? And is the moral panic exaggerated? Jack Grove reports for Times Higher Education. https://lnkd.in/embsRNmZ | 45 comments on LinkedIn
|
Scooped by
Gilbert C FAURE
March 31, 6:26 AM
|
Partager Javiera Atenas Javiera Atenas est maĂźtresse de confĂ©rences Ă la facultĂ© de commerce, dâarts, de sciences sociales et de technologie de lâuniversitĂ© du Suffolk, au Royaume-Uni. Elle dirige le certificat postuniversitaire en pratique pĂ©dagogique et enseigne lâanalyse et la visualisation des...
|
Scooped by
Gilbert C FAURE
March 31, 4:18 AM
|
Une information captĂ©e et non partagĂ©e est une information perdue. C'est l'un des angles morts les plus frĂ©quents que je rencontre dans les organisations : des Ă©quipes qui veillent, des signaux qui remontent â et des dĂ©cideurs qui n'en voient qu'une infime partie.
La veille sans circuit de diffusion n'est pas de la veille. C'est de la collection.
Trois questions pour tester votre organisation : âĄïž Qui reçoit vos synthĂšses de veille aujourd'hui ? âĄïž Sous quelle forme ? Ă quelle frĂ©quence ? âĄïž Est-ce que ça change quelque chose dans les dĂ©cisions prises ?
Si vous n'avez pas de réponse claire à ces trois questions, c'est le point de départ. Fresque de la connaissance et atelier collaboratif pour passer clairement à l'action.
C'est exactement ce qu'on travaille le 20 mai avec ceux qui veillent â et le 28 mai avec ceux qui dĂ©cident.
đ lien d'inscription suite Ă vos commentaires ou vos MP.
#décision #veilleur #veille #compétences #fresque #collaboratif #pragmatique
|
Scooped by
Gilbert C FAURE
March 30, 8:29 AM
|
đš NotebookLM frappe encore fort ! Les infographies avec styles personnalisĂ©s sont disponibles. 10 styles prĂ©dĂ©finis (Ă©ditorial, argile, kawaiiâŠ) + la possibilitĂ© de crĂ©er les vĂŽtres via un simple prompt. Vos documents transformĂ©s en visuels percutants en un clic. đ
|
Scooped by
Gilbert C FAURE
March 29, 4:06 AM
|
Quelle IA collecte le plus de données personnelles ?
DâaprĂšs lâĂ©tude de Surfshark, Meta AI est lâIA qui collecte le plus de donnĂ©es personnelles.
Elle couvre 33 types de donnĂ©es sur 35, soit presque tout ce quâil est possible de rĂ©cupĂ©rer.
Câest aussi lâune des rares Ă inclure des donnĂ©es sensibles, comme les informations financiĂšres ou certaines donnĂ©es personnelles.
DerriĂšre, dâautres outils comme Gemini collectent aussi des donnĂ©es sensibles. ChatGPT a Ă©largi sa collecte ces derniers mois, mais reste en dessous.
Claude est plutĂŽt parmi les plus sobres.
Ă retenir : vos conversations ne sont pas anodines. Elles peuvent ĂȘtre stockĂ©es, analysĂ©es, parfois utilisĂ©es Ă dâautres fins.
Un bon rĂ©flexe : Ă©viter dây partager des informations sensibles.
|
|
Scooped by
Gilbert C FAURE
Today, 11:33 AM
|
Most people think a drug works simply because of what is inside it. In reality, how it enters your body can be just as important.
This is called the route of drug administrationâoral, intravenous, intramuscular, subcutaneous, inhalational, topicalâand each route changes how a drug behaves inside you.
Why does this matter for the general population?
Because the same drug can act very differently depending on how it is taken: âą A tablet may take 30â60 minutes to act, while an injection can work within seconds âą Some drugs are destroyed in the stomach and must never be taken orally âą Incorrect use (like crushing sustained-release tablets) can lead to toxicity âą Inhalers, if used incorrectly, may deliver almost no benefit despite regular use
In simple terms: right drug + wrong route = wrong outcome
As an MD trainee in Medical Pharmacology, this is not just academic knowledge for meâit is a responsibility.
We often see: âą Antibiotic misuse due to wrong administration practices âą Poor control of chronic diseases because of improper drug use âą Adverse drug reactions that could have been prevented with basic awareness
Educating people about how to take medicines correctly can: âą Improve treatment outcomes âą Reduce side effects âą Prevent drug resistance âą Empower patients to participate in their own care
Pharmacology is not just about drugsâit is about optimizing how those drugs interact with human biology.
And sometimes, the smallest detailâlike the route of administrationâmakes the biggest difference.
#Medicine #Pharmacology #PatientEducation #RationalUseOfMedicines #Healthcare #MedicalEducation
|
Scooped by
Gilbert C FAURE
Today, 11:25 AM
|
Claude 4.5 est devenu une vraie suite de travail.
Et si vous ĂȘtes dirigeant, voici lâessentiel Ă retenir :
⊠1. Choisissez le bon modĂšle Opus 4.5 â stratĂ©gie, raisonnement complexe, sujets Ă fort enjeu Sonnet 4.5 â rĂ©daction, synthĂšse, usage business quotidien Haiku 4.5 â tĂąches simples, vitesse, gros volumes
Ne demandez pas la mĂȘme chose Ă tous les modĂšles.
Câest comme utiliser le mĂȘme vĂ©hicule pour livrer un colis⊠ou traverser un dĂ©sert.
⊠2. Claude ne sert pas quâĂ Ă©crire
Il peut aussi :
â chercher sur le web â analyser des fichiers â travailler avec Google Drive â gĂ©rer des Projects â coder avec Claude Code â produire des livrables avec Artifacts
Le sujet nâest plus âfais-moi un texteâ.
Le sujet est : fais avancer mon travail.
⊠3. Les fonctions les plus utiles pour un dirigeant
Artifacts â produire un plan dâaction, une SOP, une proposition, un tableau
Web Search â veille, prĂ©paration de rendez-vous, analyse concurrentielle
Analyse de fichiers â comprendre vos ventes, leads, devis, marges
Projects â crĂ©er une mĂ©moire de travail par sujet : marketing, commercial, recrutement, direction
Claude Code â prototyper un outil interne, accĂ©lĂ©rer un projet digital, crĂ©er un MVP
TĂ©lĂ©versement de fichiers â donner PDF, contrats, comptes rendus, captures dâĂ©cran pour quâil analyse votre rĂ©alitĂ©
⊠4. Les meilleurs cas dâusage concrets
â prĂ©parer un rendez-vous commercial en 3 minutes â transformer une rĂ©union en dĂ©cisions + tĂąches + prioritĂ©s â analyser un fichier de ventes ou de leads â crĂ©er une proposition commerciale plus vite â structurer un plan 90 jours â transformer des documents dispersĂ©s en systĂšme clair
⊠5. La vraie méthode
Le débutant dit :
âRĂ©sume-moi ça.â
Lâutilisateur avancĂ© dit :
â voici mon objectif â voici mes fichiers â voici mes contraintes â pose-moi les questions manquantes â propose un plan â exĂ©cute â amĂ©liore la V1
En clair : il ne âpromptâ pas.
Il manage Claude comme un collaborateur.
Câest ça, la vraie mise Ă jour.
---------------------------------
PS : jâorganise une masterclass le 10 avril Ă 20H pour vous aider Ă faire votre mise Ă jour Claude et apprendre Ă lâutiliser pour vous libĂ©rer de lâopĂ©rationnel, rĂ©duire votre surcharge mentale et accĂ©lĂ©rer la croissance de votre entreprise.
đ Inscription ici :
https://lnkd.in/eN6mBd9B | 12 comments on LinkedIn
|
Scooped by
Gilbert C FAURE
April 3, 11:33 AM
|
PhD Students - How to check if your research idea is actually new?
First, let's understand why novelty is important for research
Here is what reviewers will look for in your research
1ïžâŁ đđšđŻđđ„đđČ â Is it new? 2ïžâŁ đđąđ đ§đąđđąđđđ§đđ â Is it important for anyone? 3ïžâŁ đđđđĄđšđđšđ„đšđ đČ â Is it conducted the right way? 4ïžâŁ đđđ«đąđđąđđđđąđšđ§ â Can other researchers verify it? 5ïžâŁ đđ«đđŹđđ§đđđđąđšđ§ â Is it presented in the right way?
You see novelty comes on the top of this list.
To confirm novelty, meet đđđđđ§đđ© đđźđ«đđ€đ.
Eureka thinks like an IP expert.
Here is how it works.
1. Go to https://lnkd.in/dqiq55cM 2. Describe your research idea in 20-30 words 3. Eureka scans 200M+ patents to compare your idea 4. It shows you a side-by-side table of your idea vs existing ones 5. Export the entire novelty report to share with others
đđĄđČ đŹđĄđšđźđ„đ đČđšđź đđ«đČ đąđ?
â Confirms the novelty of your research idea â Gives you confidence in your research direction â Change research idea if it's not novel â After confirmation, dive deep into your research
đïž Try Eureka for FREE: https://lnkd.in/dqiq55cM
âïž Anything you'd like to add?
#phd #research | 16 comments on LinkedIn
|
Scooped by
Gilbert C FAURE
April 2, 1:21 PM
|
L'IA s'est installée dans les outils médicaux du quotidien. Pas par décision collective. Par glissement progressif.
On vous prĂ©sente le Top 10 de ses usages en santĂ© đ
On a voulu faire le point. Pas sur les promesses, sur ce qui existe réellement. Sur ce qui est déployé, ce qui est encore en cours, et ce qui reste à construire.
Ce carrousel recense les 10 usages de l'IA les plus documentĂ©s pour les soignants en 2025-2026. L'IA n'a pas attendu d'ĂȘtre invitĂ©e. Elle s'est glissĂ©e dans les logiciels de prescription, les DPI, les outils de documentation, les systĂšmes d'alerte. Par touches. Par intĂ©grations successives. Et sans que la formation des soignants n'anticipe ce deplacement
Ce que le panorama montre, c'est une réalité à deux vitesses.
D'un cĂŽtĂ©, des outils qui tiennent leurs promesses. âȘïžLe scribing IA divise par deux le temps de documentation âȘïžLa veille bibliographique qui prenait une heure se fait dorenavant en 15 minutes. A condition de bien savoir la rĂ©aliser avec lâIA. âȘïžL'aide Ă la dĂ©cision clinique, quand elle est couplĂ©e au jugement mĂ©dical, produit de meilleurs diagnostics que l'un ou l'autre seul.
Le paradoxe est net. Les outils fonctionnent. Les gains sont mesurables Et pourtant, 45 % des soignants n'informent jamais leurs patients qu'ils utilisent l'IA dans leur prise en charge. Alors que 84 % des Français souhaitent l'etre.
Ce n'est pas de la mauvaise volonté. C'est l'absence de cadre. On automatise des taches, mais on ne forme pas à ce que ca engage, puis on déploie des outils. Mais on ne prepare pas le soignant à évaluer ce qu'il délÚgue, à identifier là ou l'outil échoue, à maintenir son jugement clinique souverain face à une recommandation algorithmique.
La formation n'est pas un accessoire du déploiement. C'est sa condition de validité.
C'est exactement ce qu'Elliacare adresse. Pas l'enthousiasme pour les outils. La compétence pour les utiliser, et pour savoir quand ne pas s'y fier.
đ Swipez le carrousel pour voir les 10 usages, leurs niveaux de maturitĂ© et les chiffres qui les Ă©tayent.
Sur ces 10 usages, lesquels faites-vous déjà , et lesquels vous manquent encore ?
#IAenSanté #Elliacare #MédecineAugmentée #FormationIA | 19 comments on LinkedIn
|
Scooped by
Gilbert C FAURE
April 2, 1:13 PM
|
The number of women working as scientists and engineers in the EU reached 7.9 million in 2024, representing 40.5% of the scientists and engineersâ workforce across all economic activities. đ§âđŹđŹ
Across EU regions, highest shares in:
đȘđž Canarias (58.8%) đ”đč RegiĂŁo AutĂłnoma dos Açores (57.3%) đ”đč Madeira (56.4%)
Lowest in:
đđș KözĂ©p-MagyarorszĂĄg (30.0%) đ«đź Manner-Suomi (30.7%) đźđč Sud (31.1%)
âčïž Please note that the map includes available regional data from EU countries, EFTA and candidate countries. The ranking in the caption of the post is based on data from EU countries only.
Learn more đ https://lnkd.in/eHQqWP_g
|
Scooped by
Gilbert C FAURE
April 1, 10:15 AM
|
|
Scooped by
Gilbert C FAURE
April 1, 4:46 AM
|
Chez Prisma Media, 40% des articles sont gĂ©nĂ©rĂ©s par IA. Les journalistes sont clonĂ©s numĂ©riquement pour produire des vidĂ©os. Personne ne vous l'a dit. đ
Le Monde a signĂ© avec Meta et OpenAI pour intĂ©grer ses contenus dans les assistants IA. TF1 expĂ©rimente la production assistĂ©e. L'AFP fournit le fil que les agents IA rĂ©sument et redistribuent. Aux Ătats-Unis, 9% des articles sont dĂ©jĂ partiellement Ă©crits par l'IA, sans le mentionner.
3 434 postes de journalistes supprimés en 2025. En 2026, le rythme s'accélÚre. Washington Post, Politico, Wall Street Journal : touchés.
Ce qui est en train de mourir dans le journalisme : đ L'article de routine. RĂ©sultat sportif, cours de bourse, mĂ©tĂ©o : un agent IA rĂ©dige ça en 10 secondes. Associated Press produit 730 000 articles automatisĂ©s par an. Le journaliste qui couvre le factuel affronte une machine qui ne dort jamais. đ Le mĂ©dia comme intermĂ©diaire unique. Quand le lecteur pose une question Ă un assistant IA qui puise dans Le Monde, l'AFP et 200 autres sources, il n'a plus besoin de visiter le site du journal. Le trafic baisse. La pub baisse. Le modĂšle Ă©conomique s'effrite. đ La confiance par dĂ©faut. 9% d'articles partiellement IA sans disclosure. Des journalistes clonĂ©s chez Prisma. Et seulement 12% des lecteurs sont Ă l'aise avec un contenu 100% IA. Le jour oĂč le lecteur ne sait plus qui Ă©crit, il dĂ©croche.
Le test de survie du Monde, de TF1, de Prisma Media et de l'AFP : 1ïžâŁ Miser sur l'enquĂȘte, pas sur le flux. L'article factuel meurt. L'investigation, l'analyse, le dĂ©cryptage : c'est ce que l'IA ne sait pas faire. Le journaliste qui survit est celui qui va sur le terrain, pas celui qui reformule une dĂ©pĂȘche. 2ïžâŁ Devenir la source de confiance des agents IA. Le Monde a compris : si les assistants IA citent vos articles, vous devenez l'infrastructure de la vĂ©ritĂ©. Le mĂ©dia qui refuse de nourrir les LLM disparaĂźt des rĂ©ponses. Celui qui nĂ©gocie sa place survit. 3ïžâŁ Assumer la transparence totale sur l'usage de l'IA. Le lecteur pardonne l'IA. Il ne pardonne pas le mensonge. Prisma Media produit 40% de contenu IA ? TrĂšs bien. Mais dites-le. Le mĂ©dia qui joue la transparence gagne la confiance.
Ma conviction : le journalisme ne mourra pas. Mais le journaliste qui produit du contenu que l'IA fait mieux, oui. Resteront les enquĂȘteurs, les analystes, les Ă©ditorialistes. Ceux qui pensent.
SĂ©rie "La Chute des GĂ©ants â Saison 3" [13/15]. Hier : Clifford Chance, Gide, Bredin Prat. Demain : HEC, ESSEC, INSEAD.
đ Dirigeants : votre veille est-elle humaine ou automatisĂ©e ? đ https://lnkd.in/e6k46944 đ Consultants : la communication IA est un nouveau terrain de jeu. đ https://lnkd.in/eaJd3bZ8 đŻ Masterclass gratuite le 16 avril : passez du prompting Ă l'IA agentique en 1h. đ https://lnkd.in/eZGGrvvY đ Nos bootcamps tournent sur Claude. Un projet IA ? đ https://decisionia.com/rdv Vous lisez des articles Ă©crits par l'IA sans le savoir. Ăa vous pose un problĂšme ? đ | 12 comments on LinkedIn
|
Scooped by
Gilbert C FAURE
April 1, 3:59 AM
|
Are boys in âcrisisâ â and is the manosphere playing a part? My new feature for Nature Magazine looks at data on boys and young men, including education, health and attitudes. And it asks whether talk of a male crisis risks fueling hostility towards, or sidelining, women and girls. https://lnkd.in/eJGhkXuA
The data and interviews suggest that: - Globally, more boys than girls are out of school; young men are less likely to attend higher education.Â
- Injuries â from road accidents, violence, self-harm â are strikingly higher for male adolescents. More boys than girls die by suicide.
- Mental health disorders are a large and growing problem for boys and girls.Â
- Stereotypical ideas of masculinity are common e.g. that men must be tough, self-sufficient, financial providers and in control in relationships.  In one survey, 63% of young men said they regularly engaged with a masculinity or men influencer. But research on the manosphere and its impact is still limited.
Itâs uncomfortable & controversial to talk about âboys in crisisâ in the face of entrenched and worsening discrimination against girls and women. Many things are worse for adolescent girls.Â
The message I heard was: it's important to understand the challenges that all young people are facing.  Many thanks to the researchers & experts who spoke to me about this important topic â one that I was particularly interested to report on as the mum of three boys. | 15 comments on LinkedIn
|
Scooped by
Gilbert C FAURE
April 1, 3:56 AM
|
A few days ago, over lunch with some senior academics, I heard a really meaningful saying:
âćžèĄäžæŻćżćșäŸçïŒèæŻéćșäŸçâ (Science doesn't come from being constantly busyâ it emerges from having periods of idleness.)
That instantly reminded me of this powerful insight on creativity in science from Max Perutz, a Nobel laureate: "Creativity in science, as in art, cannot be organized. It arises spontaneously from individual talent. Well-run laboratories can foster it, but hierarchical organizations, inflexible bureaucratic rules, and mountains of futile paperwork can kill it. Discoveries cannot be plannedâthey pop up, like Puck, in unexpected corners."
In a world full of endless meetings, grant deadlines, metrics, and "productivity" pressure, these two thoughts hit hard.
Real breakthroughs often come not from grinding harder, but from protecting unstructured timeâtime to think, wander, connect dots that no schedule could predict.
ććčŸć€©è·ćčŸäœèłæ·±ćžè
ćšćé€è怩æïŒèœć°äžć„éćžžæææç話ïŒ
ăćžèĄäžæŻćżćșäŸçïŒèæŻéćșäŸçăă
éć„話çŹéèźææłè”·è«ŸèČçŸçćŸçäșșMax PerutzéæŒç§ćžć”é ćçäžæź”æ·±ć»æŽèŠïŒ
ăCreativity in science, as in art, cannot be organized. It arises spontaneously from individual talent. Well-run laboratories can foster it, but hierarchical organizations, inflexible bureaucratic rules, and mountains of futile paperwork can kill it. Discoveries cannot be plannedâthey pop up, like Puck, in unexpected corners.ă
ćšćŠä»ć
æ„èçĄæąçĄæè°ăgrant/report deadlinesăćçšźKPIsïŒä»„ćăproductivityăćŁćçćžèĄäžçèŁĄïŒéć
©æź”話çčㄿćäșșćżă
çæŁççȘç ŽïŒćŸćŸäžæŻé æŽæŒćœć°ćé èŠćččïŒèæŻäŸèȘæŒéŁäșèœèźäșșèȘç±ć°æèçç©șéă
#éŁéèŠçæ¶ć°éç„éŁć€©ææè”°
|
Scooped by
Gilbert C FAURE
March 31, 4:24 AM
|
Most people are still using AI like a chatbot.
Thatâs the biggest mistake in 2026. â ïž
Because AI is evolving into something much biggerâŠ
đ Agentic AIâ systems that donât just respond but think, plan, and execute tasks on their own.
If you understand this, youâre already ahead of 99% đ
Hereâs the simple breakdown of how modern AI works:
đč AI & ML Foundations â NLP, Deep Learning, Transformers â The core that powers everything
đč Gen AI Layer â Text, Image, Audio, Video generation â Prompt Engineering + RAG
đč AI Agents â Tool usage & automation â Memory + decision-making â Multi-step task execution
đč Agentic AI (Next Level) â Autonomous systems â Goal-based execution â Self-improving workflows
---
đĄ In simple words: We are moving from asking AI â to assigning work to AI
⥠If you learn this now: You wonât just use AI Youâll build systems that work for you 24/7
đ Save this before it disappears
đ Repost to help others learn AI
đ„ Tag someone who needs to see this
đŹ Whatâs your take on Agentic AI?
đ Follow Harish Kumar for more AI insights
#AI #ArtificialIntelligence #GenAI #AIAgents #FutureOfWork #Automation #TechTrends #AI2026. | 125 comments on LinkedIn
|
Scooped by
Gilbert C FAURE
March 31, 4:13 AM
|
Some learning sticks because itâs clear. Some sticks because itâs repeated.
But some stays with us simply because we built it ourselves.
Our latest blog explores the đđđđ đđłđłđČđ°đ and why effort increases ownership. When people contribute, solve, or create, learning stops feeling like something delivered to them and starts feeling like something they own.
That shift matters.
Because effort changes the relationship people have with what they learn. It deepens engagement. It strengthens memory. And most importantly, it makes people far more likely to use it.
In learning design, the goal isnât just to make things easy. Itâs to make space for contribution.
When learners build, decide, and shape outcomes, even in small ways, the experience becomes personal. And personal learning is the kind that lasts.
đ Write to đČđčđČđźđżđ»đ¶đ»đŽ@đčđČđźđżđ»đ»đŒđđźđđŒđżđ.đ°đŒđș to craft learning that transforms behaviour.
#LearningDesign #LearningScience #WorkplaceLearning #InstructionalDesign
https://lnkd.in/eMnCi9Nx
|
Scooped by
Gilbert C FAURE
March 30, 3:42 AM
|
PARHAF, a human-authored corpus of clinical reports for fictitious patients in French Xavier Tannier Sorbonne UniversitĂ©, UniversitĂ© Sorbonne Paris Nord, Inserm, Limics, F-75006 Paris, France Salam Abbara UniversitĂ© Paris-Saclay, UVSQ, Assistance Publique-HĂŽpitaux de Paris, Raymond PoincarĂ© University Hospital, Infectious Disease Department, Garches, France Yonsei University College of Medicine, Gangnam Severance Hospital, Department of Laboratory Medicine, Seoul, South Korea RĂ©mi Flicoteaux Assistance Publique-HĂŽpitaux de Paris, Department of medical information, Paris, France Youness Khalil Health Data Hub, 75015, Paris, France AurĂ©lie NĂ©vĂ©ol UniversitĂ© Paris-Saclay, CNRS, LISN, 91400, Orsay, France Pierre Zweigenbaum UniversitĂ© Paris-Saclay, CNRS, LISN, 91400, Orsay, France Emmanuel Bacry Health Data Hub, 75015, Paris, France UniversitĂ© Paris-Dauphine, PSL, CNRS, CEREMADE, 75016, Paris, France Abstract The development of clinical natural language processing (NLP) systems is severely hampered by the sensitive nature of medical records, which restricts data sharing under stringent privacy regulations, particularly in France and the broader European Union. To address this gap, we introduce PARHAF, a large open-source corpus of clinical documents in French. PARHAF comprises expert-authored clinical reports describing realistic yet entirely fictitious patient cases, making it anonymous and freely shareable by design. The corpus was developed using a structured protocol that combined clinician expertise with epidemiological guidance from the French National Health Data System (SNDS), ensuring broad clinical coverage. A total of 104 medical residents across 18 specialties authored and peer-reviewed the reports following predefined clinical scenarios and document templates. The corpus contains 7,394 clinical reports covering 5,009 patient cases across a wide range of medical and surgical specialties. It includes a general-purpose component designed to approximate real-world hospitalization distributions, and four specialized subsets that support information-extraction use cases in oncology, infectious diseases, and diagnostic coding. Documents are released under a CC-BY open license, with a portion temporarily embargoed to enable future benchmarking under controlled conditions. PARHAF provides a valuable resource for training and evaluating French clinical language models in a fully privacy-preserving setting, and establishes a replicable methodology for building shareable synthetic clinical corpora in other languages and health systems. Corresponding author: Xavier Tannier, xavier.tannier@sorbonne-universite.fr 1 Background & Summary 1.1 Context and Motivation Much of the information in electronic health records is conveyed by text such as clinical notes and discharge summaries (see, e.g., [17]). Natural language processing aims to unlock that information and make it available for downstream tasks. Publicly available clinical text corpora are a key asset to design, tune, and evaluate clinical natural language processing systems [9]. Sharing clinical text is, however, difficult: the tension between individual data privacy and corpus distributability has been widely acknowledged as the central obstacle to making clinical corpora publicly available [3]. Some resources have been made available for U.S. English clinical NLP over the past few decades [9], starting with the 2007 Computational Medicine Challenge [29] and the i2b2 series of clinical NLP shared tasks [34], many of which relied on the MIMIC database [30, 16]. However, access hurdles remain particularly salient in the French context. The European regulatory framework, among the most protective of health data, imposes severe restrictions on the circulation and secondary use of medical records. This creates a marked scarcity of open and usable corpora of French medical reports [27]. Beyond data access, models trained on clinical reports may themselves become sensitive, as they can memorize patient information during training, making the sharing of trained models legally and ethically challenging [2]. Together, these factors create a fragmented ecosystem in which institutions and research teams operate in isolation, unable to effectively pool data or models. This combination of restricted data access, model sensitivity, lack of open resources, and resulting fragmentation severely limits the development and robust evaluation of NLP systems applied to French clinical text. This creates the following challenge: Given this privacy bottleneck, how can a large, realistic, and fully privacy-preserving corpus of clinical reports be created to help clinical language processing research and development while being freely shareable? To address this challenge, NLP researchers have studied de-identification methods that remove personally identifying information from original clinical text [34, 26, 11, 5, 4] and used them to de-identify clinical datasets such as that in MIMIC [26]. However, the resulting text is pseudonymized (directly identifying information has been removed) but not anonymized (there is no guarantee that reidentification is impossible). This prevents it from being freely distributed. In the United States, the MIMIC database can be shared under a stringent data-use agreement, but it remains unclear whether this protocol is compatible with E.U. regulation. For this reason, clinical document collections in French (e.g., the MERLOT corpus [3] and other corpora extracted from French clinical data warehouses [15, 33]) were used in evaluation studies with explicit targeted ethical board approval but could not be shared due to privacy restrictions. Hahn [12] notes that, faced with the non-shareability of real patient records, NLP researchers have developed a variety of proxies for clinical text. One type of proxy is machine translation of English clinical datasets: Becker et al. [1] translated into German the ShARe/CLEF eHealth 2013 training dataset based on MIMIC-II data [30], Neves et al. [28] translated some clinical cases from English into French (with a focus on evaluating the performance of machine translation for measures and acronyms) and Frei et al. [7] translated into German the 2018 n2c2 shared task dataset that reused data from MIMIC-III [16]. However, translated text requires thorough human review, and cultural and health system differences make the resulting text sensibly different from native clinical text. Another proxy is synthetic clinical text. GraSCCo [24] manually edited 63 deidentified German discharge summaries and case reports at multiple linguistic levels to make reidentification virtually impossible. Recent efforts have also explored the use of autoregressive generative language models to produce synthetic clinical documents in English [14], French [13], German [8], Swedish and Spanish [35]. Nonetheless, the balance between privacy and utility of the resulting material needs further analysis [21, 6]. Published case reports are a more distant proxy for clinical text, but their open-source status and existence in multiple languages have made them particularly attractive for clinical NLP. Case reports have been collected, for instance, in the following corpora: CAS [10] (French), E3C [20] (Italian, English, French, Spanish, and Basque), CANTEMIST [22] and DISTEMIST [23] (Spanish), and FRASIMED [36] (French translations of CANTEMIST and DISTEMIST). The style of case reports, however, is quite different from that of electronic health records. The closest proxy for true clinical texts is those written by health care professionals about fictitious patients, for instance, in medical textbooks or course material. The JSynCC corpus [19] extracted 400 operative reports and 470 case reports from such textbooks. The initial copyright on the textbooks, though, prevents the free distribution of the corpus. The PARROT corpus [18] contains 2,658 radiology reports about fictitious patients, including 475 in French, written on a volunteer basis by healthcare professionals from 21 countries. This endeavor was made possible through human networking, including leveraging professional radiological societies, which may be difficult to scale up to a diversity of medical specialties. 1.2 Objectives and Contributions To overcome this limitation, the approach adopted in the present work was to ask healthcare professionals to write new clinical reports describing fictitious patients specifically for the creation of a shareable corpus, and to distribute these reports under an open license. Because the reports are created for this purpose and do not derive from real patient data, they are anonymous and shareable by design. However, this approach raises important methodological questions: how can such reports be generated in a way that ensures both medical realism and statistical representativeness while preserving privacy? To address this challenge, we designed a corpus creation protocol that leverages cliniciansâ expertise while being guided by public health statistics, principles of corpus development [31, 37], and a set of predefined clinical scenarios. The protocol was implemented using a large pool of French-speaking clinicians through a partnership with associations and unions of medical residents across multiple medical and surgical specialties, which recruited 104 residents as report authors. Guidelines were developed for selecting clinical cases, using data from the French National Health Data System (SNDS [25]) as reference scenarios for report creation, and the residents authored synthetic medical reports following these guidelines. The resulting open-source French-language corpus can now be used to train and evaluate language models on targeted medical use cases. In this article, we introduce this open-source corpus of French clinical documents. PARHAF comprises 7,394 expert-authored clinical reports describing 5,009 realistic yet fictitious patient cases. Each case is accompanied by structured documentation of the underlying clinical scenario, including the primary diagnosis, main procedure, care pathway, and discharge information when applicable. We further provide three specialized subsets specifically designed to support information extraction tasks in oncology and infectious diseases. This corpus offers a valuable resource for the development and evaluation of clinical NLP models, directly tackling the root cause of all the challenges outlined above : the inherently sensitive nature of clinical data. We release 6,185 documents corresponding to 4,254 fictitious patients under an open-source CC-BY license. The remaining portion of the corpus will be temporarily embargoed to enable future evaluations under controlled conditions, thereby limiting the risk of large language model contamination through prior exposure to the data. 1.3 Intended uses of PARHAF This corpus is intended for research, development, and educational purposes in clinical natural language processing. It enables the sharing of clinical-style notes and annotations and supports community-wide pooling of efforts around a common, openly accessible resource. The corpus is suitable for benchmarking French medical language models, including large language models, and for conducting reproducible clinical NLP research under controlled and privacy-safe conditions. The corpus also supports uses for medical teaching, such as training medical students and residents in structured clinical report writing, diagnostic reasoning, and clinical information synthesis. It can also serve as a resource for clinical case preparation and supports training in clinical natural language processing using realistic yet fictitious reports without exposure to sensitive patient data. The corpus further enables privacy-preserving data augmentation, either as a standalone resource or as a complement to restricted-access clinical datasets, provided its fictitious nature is explicitly acknowledged. Finally, the representativeness of part of the corpus is geared towards three use cases of the PARTAGES project, allowing methodological comparisons across these specific use cases. 1.4 Limitations and non-intended uses Although efforts were made to create a diverse corpus that includes a variety of document types and clinical specialties, the corpus does not cover all specialties and variations of French clinical text. This corpus is intended for research purposes only, specifically for training and evaluating natural language processing models on French clinical text. It is not a substitute for clinically validated data and must not be used to support regulatory approval, clinical certification, or deployment decisions in real healthcare settings. It is not suitable for clinical use. It cannot be used for clinical decision-making, diagnosis, prognosis, treatment, or patient care. Models trained or evaluated on this data are not clinically validated, and results obtained on this corpus cannot be presented as evidence of clinical performance or safety. The corpus does not support generalization claims to real hospitals, regions, or clinical practices, nor does it allow epidemiological or population-level inference, as its distributions do not reflect real-world prevalence. It is also unsuitable for longitudinal studies or for assessing real-world clinical risk or safety, including rare adverse events or edge cases, and must not be used as a replacement for real clinical data in deployment settings. Finally, the corpus does not capture the operational constraints of real clinical environments (e.g., time pressure, workload, interruptions) and should not be used for stress-testing models under realistic clinical conditions. 2 Methods 2.1 Challenges Building the PARHAF corpus required addressing two main challenges. The first was ensuring that recruited physicians and collected texts adequately represent the relevant dimensions of clinical language, while remaining within concrete implementation constraints (a limited author pool, a fixed corpus size, and the specific use cases targeted by the project). The second was encouraging healthcare professionals to write reports that closely resemble real clinical documents, while minimizing the risk of privacy leaks. 2.2 Clinical Scenario Design For the above reason, we deemed essential to provide relatively precise guidelines to assist healthcare professionals in authoring clinical reports that closely resemble real-world documents while minimizing the risk of privacy breaches. These guidelines addressed both the content and the format of the documents. Given that the recruited physicians primarily worked in hospital settings, the resulting corpus of documents focused predominantly on hospital-based clinical situations. 2.2.1 Content Development The selection of clinical scenarios was guided by our goal of guaranteeing the representativeness of the clinical situations actually observed in French hospitals (see Section 2.3.1) and by the constraint of ensuring physicians were familiar with the clinical situation in relation to their specialty of practice or training. Both aspects were addressedusing hospitalization claims data available in the French National Health Data System (SNDS [25]). Scenarios were constructed by sampling observed distributions of Diagnosis-Related Groups (DRGs), principal diagnoses (ICD-10), age, sex, type of management (e.g., ambulatory surgery), admission and discharge modes (e.g., emergency department admission). DRGs were used (in a less formal format) to describe the type of hospitalization (e.g., surgery, medicine) and to map clinical cases to physiciansâ qualifications (specialties). Secondary diagnoses were incorporated into the scenarios as a list of 10 randomly selected diagnoses from the pool of diagnoses frequently associated with the primary diagnosis-DRG pair. Patient name was randomly fixed. Based on these core elements, authors were encouraged to develop the clinical case details, enriching the content with relevant and realistic information, maintaining medical consistency with the baseline information, and ensuring depth and authenticity while adhering to principles of plausibility and ethics. 2.2.2 Document Format For document format, we aligned official recommendations with physiciansâ actual practices to develop specialized templates for each type of hospitalization: âą Medical hospitalization â Hospital discharge summary âą Surgical hospitalization â Pre-operative consultation report (for scheduled admissions) â Operative report â Hospital discharge summary (for ambulatory surgery, a single document was requested combining both the operative report and the discharge summary) âą Obstetrics (childbirth) â Pre-delivery hospitalization report (for high-risk pregnancies) or emergency department visit report (for low-risk pregnancies) â Delivery room report â Postpartum hospitalization report (maternity ward) âą Oncology â Pathology report For discharge summaries, the template included: department name, reason for admission, medical history, surgical history, family history, allergies, lifestyle factors, treatment at admission, history of the present illness, clinical examination, complementary investigations, in-hospital course, discharge treatment, and conclusion. Similar minimal templates were developed for surgeries, obstetrics, and pathology reports. Authors were encouraged to follow these structures or to write in free text format, provided that all required information was included. Finally, a structured summary section was completed at the end of each report, in which the authors specified the primary diagnosis, length of stay, and associated diagnoses mentioned in the report. The use of generative artificial intelligence tools was discouraged because it could bias both the content and the stylistic features of the reports. 2.3 Document Type and Distribution Strategy This corpus is structured in two complementary components, targeting a total of 5,000 patients. The primary component (n = 3,900) includes patients across a wide range of medical specialties and is designed to maximize diversity and approximate representativeness, although the target size does not allow full coverage of the spectrum of possible clinical cases. The secondary component focuses on specific use cases (ICD-10 coding, oncology, and infectious diseases) and comprises patients selected outside the main distribution to support more targeted evaluation scenarios. 2.3.1 Core distribution To approximate real-world distributions of medical activity, we relied on diagnosis frequencies derived from SNDS [25], which provides exhaustive, nationwide hospital claims data and served as a proxy for the underlying epidemiological and care distribution across medical conditions. For the year 2024, the national claims database consisted of approximately 18 million hospitalizations drawn from the SNDS. From the data, we defined clinical cases as the association of a DRG, sex, age group, and length-of-stay group. With these associations, we created a sampling database of 100,000 different clinical cases. These cases covered around 4,000 distinct ICD-10 primary diagnoses. To ensure patient privacy and data confidentiality, the sampling strategy over this clinical cases distribution adheres to the principles of k-anonymity and l-diversity [32]. To preserve epidemiological realism while avoiding excessive over-representation of very frequent conditions, which would reduce clinical diversity in the corpus, we applied a square-root transformation to the empirical frequencies, yielding a preliminary sampling probability proportional to fi\sqrt{f_{i}}, where fif_{i} is the frequency of condition ii in the SNDS data. To further limit the dominance of the most common conditions, we capped this value at a maximum probability pmaxp_{\mathrm{max}} (corresponding to a 0.1% sampling chance) and renormalized, giving the final sampling probability for each condition: pi=minâĄ(pmax,fi)âjminâĄ(pmax,fj)p_{i}=\frac{\min(p_{\mathrm{max}},\,\sqrt{f_{i}})}{\sum_{j}\min(p_{\mathrm{max}},\,\sqrt{f_{j}})} In practice, this theoretical distribution required iterative adjustment to account for operational constraints. Because hired authors had uneven expertise across medical specialties, document production could not be distributed uniformly, and not all specialties could be covered. The final allocation therefore used the square-root-with-cap model as a guiding principle, with reallocation based on actual case availability and author capacity, while preserving broad clinical coverage. Figures S1 and S2 in the Supplementary Materials provide, respectively, a detailed breakdown of these adjustments by specialty and the number of cases written per author. 2.3.2 Specific Use Cases In addition to the documents from the initial distribution, four specific sets of reports were assembled. Each set was designed to address a specific clinical information extraction use case: Coding Surgery reports from digestive surgery, orthopedic surgery, traumatology, and urology were specifically collected for a use case on ICD-10 diagnostic coding. Identifying biomarkers in oncology Pathology reports containing descriptions of tumor biomarkers used to inform diagnosis, prognosis, and targeted therapy selection in oncology: tissue and genomic alterations such as mutations, amplifications, and protein expression. The use case for this dataset is the automatic identification of these biomarkers. Identifying the response to treatment in oncology Oncology consultation reports mentioning treatment response (complete response, partial response, stable disease, progressive disease, not applicable, or indeterminate). The associated use case aims at classifying RECIST-style information from these reports. Infectiology Reports describing infectious episodes (including bacteremia) along with the causative bacteria and the primary site of infection. Other use cases (pseudonymization and summarization) are planned for the PARTAGES project (described below). However, the documents dedicated to these tasks do not require a distribution that differs from that described in the previous section. 2.4 Implementation The PARTAGES project, funded by the French government under the France 2030 initiative (operated by Bpifrance), brings together a consortium of 32 partners, including research teams, public and private healthcare institutions, and AI-focused deeptech companies. Its aim is to develop open resources to support the emergence of generative AI solutions in healthcare. The creation of the PARHAF corpus of clinical reports is one of the consortiumâs initiatives. The corpus development was initiated through a scoping phase involving NLP experts and physicians, aimed at balancing production volume with budgetary constraints. Writing time was estimated at 60 minutes per document: 45 minutes for the first author and 15 minutes for review, validation, and correction by a second expert. This estimation was based on the expertise of the physicians involved and consultation with twelve residents from different specialties, active within their respective residentsâ associations. A maximum duration of 60 minutes per document was retained, with a gross hourly compensation of âŹ40, corresponding to a maximum of âŹ40 per completed and reviewed report. Recruitment was conducted through a temporary employment agency. A national outreach campaign targeted 21 residentsâ associations across different specialties in France. Eleven associations disseminated the call for participation, representing the following specialties: internal medicine, infectious diseases, visceral surgery, obstetrics and gynecology, neurology, pulmonology, public health, urology, oncology, anesthesiology and intensive care, and anatomical pathology. The call for participation was further circulated via the projectâs hospital partners and their associated medical networks, enabling the inclusion of residents from additional specialties: nephrology, hematology, orthopedics, pediatrics, gastroenterology and hepatology, and cardiology. More than 500 applications were received, reflecting strong engagement from the medical community. A final panel of 104 residents was selected, prioritizing residents in the later stages of training and ensuring broad geographic representation. This response confirmed residentsâ commitment to developing digital commons and supporting generative AI projects in healthcare. To ensure consistency and quality of the outputs, a structured support framework was implemented, including regular webinars, methodological guidelines, a centralized communication platform, and dedicated support. Contributors were also involved in methodological refinements through specialty-specific meetings, enabling adaptation of instructions to clinical practice and ensuring the validity and representativeness of the corpus. From a financial perspective, operational costs primarily consisted of physician compensation, increased by the management fees of the temporary employment agency (multiplicative coefficient of 1.9 applied to the gross remuneration). The production of the 7,394 clinical reports, totaling 5,518 hours of effective work, resulted in a total operational cost of approximately âŹ495,000. 3 Data Records PARHAF consists of a single JSON file containing structured metadata about fictitious patients and the clinical documents associated with them. Each entry in the data array corresponds to one patient record and includes patient-level metadata, contextual information about the care scenario, and a list of associated documents. The documents themselves are not embedded in the JSON file. Instead, each document is referenced via a relative file path pointing to an external text file. These text files are stored separately and organized by medical specialty, with one directory per specialty. Each document file contains raw, unannotated plain text in French, with no markup, labels, or structural tags applied. The JSON file, therefore, acts as the index and metadata layer of the corpus, while the directory structure contains the raw textual content. The linkage between metadata and text relies exclusively on the relative paths specified in the documents[].path fields. The high-level structure of the JSON format and the path-based schema description are described in Figure 1 and Table 1, respectively. In addition to this standalone dataset, we also distribute a Hugging Face dataset (Parquet/Arrow) that is a derived representation generated automatically from the JSON files. Both formats, therefore, contain identical information and differ only in storage layout. The PARHAF corpus is openly available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license and the Etalab 2.0 license. It was released on March 25, 2026. The primary distribution is on Hugging Face (https://huggingface.co/datasets/HealthDataHub/PARHAF). 4 Data Overview A subset of the dataset is temporarily embargoed to enable future evaluations under controlled conditions, thereby limiting the risk of large language model contamination through prior exposure to the data. In total, we release 4,254 patients (6,185 documents) and keep 755 patients (1,209 documents) under embargo for future release. Figures 2, 3, and 4 provide details about, respectively, the patient count per medical specialty, the population pyramid chart, and the document and word counts by document type. 5 Technical Validation All contributors involved in report writing and validation completed standardized online training covering the study objectives, authoring guidelines, and validation procedures. The full clinical scenario framework was presented in detail to ensure a consistent understanding of context and expectations. Additional specialty-specific instructions were provided where relevant, including requirements regarding the number and types of documents per patient and scenario-dependent constraints. To preserve ecological validity and avoid burdening contributors with rigid formatting that would diverge from real-world clinical practice, reports were produced as guided but free-text documents within an online text-editing environment. Human validators were responsible for content quality control, assessing clinical coherence and adherence to the instructions. This step resulted in a rejection rate of 3.6% of submitted documents. Beyond human review, an automated âsanity checkâ pipeline was implemented to verify compliance with key structural and procedural requirements. These checks included validating document type and the expected number of documents per patient, confirming final hospitalization duration entries, and recording required clinical procedures or acts. This automated stage ensured the consistency of core structured elements across the corpus. Following automated validation, 120 documents required manual correction. These interventions addressed minor typographical errors within structured fields, format-style deviations that interfered with automated parsing, or unauthorized modifications to the original instruction template. No corrections altered the underlying clinical content; all changes were limited to restoring technical conformity with the dataset specifications. 6 Data Availability The PARHAF corpus is openly available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license, and the Etalab 2.0 license. It was released on March 25, 2026. The primary distribution is on Hugging Face at https://huggingface.co/datasets/HealthDataHub/PARHAF. 7 Code Availability The code used for data processing and quality control is publicly available on GitHub at https://github.com/xtannier/PAHRAF_cleaning_and_publication. This includes scripts for converting source documents from .docx to plain text, generating JSON metadata, building the Hugging Face Parquet dataset, and running the automated sanity-check pipeline described above. References [1] M. Becker and B. Böckmann (2016) Extraction of UMLS concepts using Apache cTAKES for German language. In Health Inform. Meets eHealth, G. Schreier, E. Ammenwerth, A. Hörbst, and D. Hayn (Eds.), Studies in Health Technology and Informatics, Vol. 223, pp. 71â76. External Links: Link, Document Cited by: §1.1. [2] G. Berthelier, A. Boutet, and A. Richard (2023) Toward training NLP models to take into account privacy leakages. In 2023 IEEE Int. Conf. Big Data (BigData), pp. 4854â4862. Cited by: §1.1. [3] L. Campillos, L. DelĂ©ger, C. Grouin, T. Hamon, A. Ligozat, and A. NĂ©vĂ©ol (2018) A French clinical corpus with comprehensive semantic annotations: development of the Medical Entity and Relation LIMSI annOtated Text corpus (MERLOT). Lang. Resour. Eval. 52 (2), pp. 571â601. Cited by: §1.1. [4] R. Chevrier, V. Foufi, C. Gaudet-Blavignac, A. Robert, and C. Lovis (2019-May 31) Use and understanding of anonymization and de-identification in the biomedical literature: scoping review. J. Med. Internet Res. 21 (5), pp. e13484. Cited by: §1.1. [5] F. Dernoncourt, J. Y. Lee, O. Uzuner, and P. Szolovits (2016-Dec 30) De-identification of patient notes with recurrent neural networks. J. Am. Med. Inform. Assoc.. Cited by: §1.1. [6] F. Estignard, S. Ghannay, J. Girard-Satabin, N. Hiebel, and A. NĂ©vĂ©ol (2025) Evaluating the confidentiality of synthetic clinical texts generated by language models. In Int. Conf. Artif. Intell. Med. (AIME), pp. 130â139. Cited by: §1.1. [7] J. Frei, L. Frei-Stuber, and F. Kramer (2023) GERNERMED++: semantic annotation in German medical NLP through transfer-learning, translation and word alignment. J. Biomed. Inform. 147, pp. 104513. External Links: Link, Document Cited by: §1.1. [8] J. Frei and F. Kramer (2023) Annotated dataset creation through large language models for non-English medical NLP. J. Biomed. Inform. 145, pp. 104478. External Links: Link, Document Cited by: §1.1. [9] Y. Gao, D. Dligach, L. Christensen, S. Tesch, R. Laffin, D. Xu, T. Miller, O. Uzuner, M. M. Churpek, and M. Afshar (2022-Sep 12) A scoping review of publicly available language tasks in clinical natural language processing. J Am Med Inform Assoc 29 (10), pp. 1797â1806. Cited by: §1.1, §1.1. [10] N. Grabar, V. Claveau, and C. Dalloux (2018) CAS: French corpus with clinical cases. In Proc. 9th Int. Workshop Health Text Mining Inf. Anal., pp. 122â128. Cited by: §1.1. [11] C. Grouin, A. Rosier, O. Dameron, and P. Zweigenbaum (2009) Testing tactics to localize de-identification. In Proc. MIE 2009, K. Adlassnig, J. Mantas, and B. Blobel (Eds.), Studies in Health Technology and Informatics, Vol. 150, Amsterdam, pp. 735â739. Cited by: §1.1. [12] U. Hahn (2025-06) Clinical document corpora-real ones, translated and synthetic substitutes, and assorted domain proxies: a survey of diversity in corpus design, with focus on German text data. JAMIA Open 8 (3), pp. ooaf024. External Links: Document Cited by: §1.1. [13] N. Hiebel, O. Ferret, K. Fort, and A. NĂ©vĂ©ol (2023) Can synthetic text help clinical named entity recognition? a study of electronic health records in French. In Proc. 17th Conf. Eur. Chapter Assoc. Comput. Linguist. (EACL), pp. 2320â2338. Cited by: §1.1. [14] J. Ive, N. Viani, J. Kam, L. Yin, S. Verma, S. Puntis, R. N. Cardinal, A. Roberts, R. Stewart, and S. Velupillai (2020) Generation and evaluation of artificial mental health records for natural language processing. NPJ Digit. Med. 3 (1), pp. 69. Cited by: §1.1. [15] A. Jannot, E. Zapletal, P. Avillach, M. Mamzer, A. Burgun, and P. Degoulet (2017) The Georges Pompidou University hospital clinical data warehouse: a 8-years follow-up experience. Int. J. Med. Inform. 102, pp. 21â28. Cited by: §1.1. [16] A. E. W. Johnson, T. J. Pollard, L. Shen, L. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark (2016-May 24) MIMIC-III, a freely accessible critical care database. Sci. Data 3, pp. 160035. External Links: Document Cited by: §1.1, §1.1. [17] E. Laparra, A. Mascio, S. Velupillai, and T. Miller (2021-08) A review of recent work in transfer learning and domain adaptation for natural language processing of electronic health records. Yearb Med Inform 30 (1), pp. 239â244. Cited by: §1.1. [18] B. Le Guellec, K. Adambounou, L. C. Adams, T. Agripnidis, S. S. Ahn, R. Ait Chalal, T. Akinci DâAntonoli, P. Amouyel, H. Andersson, R. Bentegeac, C. Benzoni, A. A. Blandino, F. Busch, E. Can, R. Cau, A. U. Cavallo, C. Chavihot, E. Chiquete, R. Cuocolo, E. Divjak, B. Dziadkowiec-Macek, A. Elogne, S. C. Fanni, C. Ferrarotti, C. Fossataro, F. Fossataro, K. FuĆek, M. FuĆek, P. GaÄ, M. Gachowska, I. GarcĂa-JuĂĄrez, M. Gatti, N. Gorelik, A. M. Goulianou, A. Hamroun, N. Herinirina, Q. Holay, G. Ivanac, F. Kitamura, M. E. Klontzas, A. Kompanowska, R. Kompanowski, K. Kraik, D. Krupka, A. LefĂšvre, T. Lemke, M. Lindholz, P. Macek, M. Makowski, L. Mannacio, A. Meddeb, L. MĂŒller, A. Natale, B. Nguema Edzang, A. Ojeda, Y. W. Park, F. Piccione, A. Ponsiglione, M. PorÄba, R. PorÄba, P. Prucker, J. Pruvo, R. alba Pugliesi, F. H. Rabemanorintsoa, V. Rafailidis, K. Resler, J. Rotkegel, L. Saba, E. Siebert, A. Stanzione, A. F. Tekin, L. Toapanta-Yanchapaxi, M. Triantafyllou, E. Tsaoulia, S. Urban, E. Vassalou, F. Vernuccio, W. Wang, J. WassĂ©lius, A. WĆodarczak, S. WĆodarczak, A. Wysocki, L. Xu, T. ZatoĆski, S. Zhang, S. Ziegelmayer, G. Kuchcinski, and K. K. Bressem (2026) PARROT, an open multilingual radiology reports dataset. Eur. J. Radiol. Artif. Intell. 5, pp. 100066. External Links: ISSN 3050-5771, Document, Link Cited by: §1.1. [19] C. Lohr, S. Buechel, and U. Hahn (2018) Sharing copies of synthetic clinical corpora without physical distribution - A case study to get around IPRs and privacy constraints featuring the German JSYNCC corpus. In Proc. LREC 2018, N. Calzolari, K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis, and T. Tokunaga (Eds.), External Links: Link Cited by: §1.1. [20] B. Magnini, B. Altuna, A. Lavelli, M. Speranza, and R. Zanoli (2020) The E3C project: collection and annotation of a multilingual corpus of clinical cases. In Proc. of the Seventh Italian Conf. on Computational Linguistics, CLiC-it 2020, Bologna, Italy, March 1-3, 2021, J. Monti, F. DellâOrletta, and F. Tamburini (Eds.), CEUR Workshop Proceedings, Vol. 2769. External Links: Link Cited by: §1.1. [21] O. Melamud and C. Shivade (2019) Towards automatic generation of shareable synthetic clinical notes using neural language models. In Proc. 2nd Clinical NLP Workshop, pp. 35â45. Cited by: §1.1. [22] A. Miranda-Escalada, E. FarrĂ©, and M. Krallinger (2020) Named entity recognition, concept normalization and clinical coding: overview of the Cantemist track for cancer text mining in Spanish, corpus, guidelines, methods and results. In Proc. IberLEF 2020, M. Ă. G. Cumbreras, J. Gonzalo, E. M. CĂĄmara, R. MartĂnez-Unanue, P. Rosso, S. M. JimĂ©nez-Zafra, J. A. O. Zambrano, A. Miranda, J. Porta-Zamorano, Y. GutiĂ©rrez, A. RosĂĄ, M. Montes-y-GĂłmez, and M. G. Vega (Eds.), CEUR Workshop Proceedings, Vol. 2664, pp. 303â323. External Links: Link Cited by: §1.1. [23] A. Miranda-Escalada, L. GascĂł, S. Lima-LĂłpez, E. FarrĂ©-Maduell, D. Estrada, A. Nentidis, A. Krithara, G. Katsimpras, G. Paliouras, and M. Krallinger (2022) Overview of DisTEMIST at BioASQ: automatic detection and normalization of diseases from clinical texts: results, methods, evaluation and multilingual resources. In Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum, Bologna, Italy, September 5th - to - 8th, 2022, G. Faggioli, N. Ferro, A. Hanbury, and M. Potthast (Eds.), CEUR Workshop Proceedings, Vol. 3180, pp. 179â203. External Links: Link Cited by: §1.1. [24] L. Modersohn, S. Schulz, C. Lohr, and U. Hahn (2022) GRASCCO - the first publicly shareable, multiply-alienated German clinical text corpus. In Ger. Med. Data Sci. 2022, R. Röhrig, N. Grabe, V. S. Hoffmann, U. HĂŒbner, J. König, U. Sax, B. Schreiweis, and M. Sedlmayr (Eds.), Studies in Health Technology and Informatics, Vol. 296, pp. 66â72. External Links: Link, Document Cited by: §1.1. [25] N. Moore, P. Blin, R. Lassalle, N. Thurin, P. Bosco-Levy, and C. Droz (2021) National health insurance claims database in France (SNIRAM), systĂšme national des donnĂ©es de santĂ© (SNDS) and Health Data Hub (HDH). In Databases for pharmacoepidemiological research, pp. 131â140. Cited by: §1.2, §2.2.1, §2.3.1. [26] I. Neamatullah, M. M. Douglass, L. H. Lehman, A. Reisner, M. Villarroel, W. J. Long, P. Szolovits, G. B. Moody, R. G. Mark, and G. D. Clifford (2008-07) Automated de-identification of free-text medical records. BMC Med. Inform. Decis. Mak. 8. External Links: Document, ISSN 1472-6947, Link Cited by: §1.1. [27] A. NĂ©vĂ©ol, H. Dalianis, S. Velupillai, G. Savova, and P. Zweigenbaum (2018) Clinical natural language processing in languages other than English: opportunities and challenges. J. Biomed. Semant. 9 (1), pp. 12. Cited by: §1.1. [28] M. Neves, A. J. Yepes, A. Siu, R. Roller, P. Thomas, M. V. Navarro, L. Yeganova, D. Wiemann, G. M. Di Nunzio, F. Vezzani, et al. (2022) Findings of the WMT 2022 biomedical translation shared task: monolingual clinical case reports. In Proc. 7th Conf. Mach. Transl. (WMT), pp. 694â723. Cited by: §1.1. [29] J. P. Pestian, C. Brew, P. Matykiewicz, D. Hovermale, N. Johnson, K. B. Cohen, and W. Duch (2007-06) A shared task involving multi-label classification of clinical free text. In Biol. Transl. Clin. Lang. Process., K. B. Cohen, D. Demner-Fushman, C. Friedman, L. Hirschman, and J. Pestian (Eds.), Prague, Czech Republic, pp. 97â104. External Links: Link Cited by: §1.1. [30] M. Saeed, M. Villarroel, A. T. Reisner, G. Clifford, L. Lehman, G. Moody, T. Heldt, T. H. Kyaw, B. Moody, and R. G. Mark (2011-05) Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-II): a public-access intensive care unit database. Crit. Care Med. 39, pp. 952â960. Note: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3124312/ Cited by: §1.1, §1.1. [31] J. Sinclair (2004) Corpus and text: basic priniciples. In Developing Linguistic Corpora: a Guide to Good Practice, M. Wynne (Ed.), External Links: ISSN 1463 5194 Cited by: §1.2. [32] L. Sweeney (2002-10) K-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10 (5), pp. 557â570. External Links: ISSN 0218-4885, Link, Document Cited by: §2.3.1. [33] X. Tannier, P. WajsbĂŒrt, A. Calliger, B. Dura, A. Mouchet, M. Hilka, and R. Bey (2024) Development and validation of a natural language processing algorithm to pseudonymize documents in the context of a clinical data warehouse. Methods Inf. Med. 63 (01/02), pp. 021â034. Cited by: §1.1. [34] Ă. Uzuner, Y. Luo, and P. Szolovits (2007) Evaluating the state-of-the-art in automatic de-identification. J. Am. Med. Inform. Assoc. 14, pp. 550â563. External Links: Document Cited by: §1.1, §1.1. [35] T. Vakili, A. Henriksson, and H. Dalianis (2025-07) Data-constrained synthesis of training data for de-identification. In Proc. 63rd Annu. Meet. Assoc. Comput. Linguist. (ACL), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar (Eds.), Vienna, Austria, pp. 27414â27427. External Links: Link, Document, ISBN 979-8-89176-251-0 Cited by: §1.1. [36] J. Zaghir, M. Bjelogrlic, J. Goldman, S. Aananou, C. Gaudet-Blavignac, and C. Lovis (2024-05) FRASIMED: a clinical French annotated resource produced through crosslingual BERT-based annotation projection. In Proc. LREC-COLING 2024, N. Calzolari, M. Kan, V. Hoste, A. Lenci, S. Sakti, and N. Xue (Eds.), Torino, Italia, pp. 7450â7460. External Links: Link Cited by: §1.1. [37] P. Zweigenbaum, P. Jacquemart, N. Grabar, and B. Habert (2001) Building a text corpus for representing the variety of medical language. Stud Health Technol Inform 84 (Pt 1), pp. 290â294. Cited by: §1.2. 8 Author Contributions Xavier Tannier: Conceptualization, Methodology, Software, Validation, Data Curation, Writing - Original Draft, Writing - Review & Editing, Visualization, Funding acquisition. Salam Abbara: Conceptualization, Methodology, Writing - Review & Editing, Funding acquisition. RĂ©mi Flicoteaux: Conceptualization, Methodology, Validation, Writing - Review & Editing. Youness Khalil: Methodology, Writing - Review & Editing, Project administration. AurĂ©lie NĂ©vĂ©ol: Conceptualization, Writing - Review & Editing, Funding acquisition. Pierre Zweigenbaum: Conceptualization, Writing - Review & Editing, Funding acquisition. Emmanuel Bacry: Conceptualization, Supervision, Project coordinator, Funding acquisition, Writing - Review & Editing. 9 Competing Interests The authors declare no competing interests related to this work. 10 Acknowledgements We thank the authors of the reports for their contribution and feedback on the protocol, as well as the PARTAGES consortium members for fruitful discussions towards corpus development. We also thank Florian Pons for helping with operational support and project coordination. 11 Funding This work was carried out as part of the PARTAGES project, awardee of the Bpifrance France 2030 call for proposals âDigital Commons for Generative Artificial Intelligence.â 12 Ethics statement* Through their affiliations with French public service agencies, the developers of the PARHAF corpus have benefited from access to SNDS data. The clinical scenarios used to write the clinical documents in the PARHAF corpus are based on aggregated public health statistics and do not pertain to identifiable real patients. No private information about individual subjects was used in this study; therefore, no IRB or ethics approval was required to create or distribute the PARHAF corpus. The clinical document authors involved in this study were apprised of the full document creation protocol. Participation was voluntary, and authors were compensated for their work in accordance with French labor laws. The PARHAF corpus is intended for use as educational material and as support for the development and evaluation of clinical NLP systems. It is not intended for clinical use. Supplementary Material Figure S1 shows the changes in distribution for each specialty, after the adjustment described in Section 2.3.1. Figure S2 illustrates the number of patients written by each author.
|
Scooped by
Gilbert C FAURE
March 29, 3:50 AM
|
98% people donât fail with AIâŠ
Theyâre just stuck on beginner tools forever.
And then wonder why results stay average.
When ChatGPT and some AI tools launched towards the end of 2022, I was doing everything the âeasy wayâ. â I used basic prompts â Used generic tools â Got average results But it felt productive.
But now I do 10x more work⊠in half the time. With a better tool stack.
Thatâs when it clicked: AI rewards leverage.
And the gap between novice and expert level AI tools is where that leverage lives.
Hereâs what changed everything for me:
âą Beginners use AI to assist âą Experts use AI to do the work
âą Beginners ask questions âą Experts build workflows and agents
âą Beginners chase outputs âą Experts chase systems
Now look at how the shift actually plays out:
â Presentations: Gamma creates beautiful presentations â Data: Rows for structured analysis â Research: ChatGPT Deep Research for depth â Learning: NotebookLM for different learning formats â Video: VEED for end-to-end video content editing â Coding: Claude Code for precision â Websites: Webflow for aesthetic websites â Apps: Replit to build anything
The difference is the level of thinking behind the tool.
Doâs for expert-level AI usage: âïž Build repeatable workflows âïž Combine multiple tools into one pipeline âïž Use AI for execution âïž Validate outputs with real-world data âïž Continuously upgrade your stack
Donâts for expert-level AI usage: âïž Donât rely on a single tool for everything âïž Donât skip context âïž Donât blindly trust outputs âïž Donât ignore speed vs quality tradeoffs âïž Donât stay comfortable with beginner setups
Most people get stuck because beginner tools feel âgood enoughâ. But âgood enoughâ is the enemy of scale.
If you want unfair advantage, You need better tools. Used like an expert.
Check the infographic đ Iâve covered all the expert-level tools clearly.
And now Iâm curious: What do you think about using expert-level AI tools? Are they overkill or the real edge? Comment below đ
â Follow for more breakdowns like this.
â»ïž Repost to help your network use expert-level AI tools | 116 comments on LinkedIn
|