 Your new post is loading...
 Your new post is loading...
|
Scooped by
Gilbert C FAURE
October 13, 2013 8:40 AM
|
is a personal Notebook Thanks John Dudley for the following tweet "If you like interesting snippets on all sorts of subjects relevant to academia, information, the world, highly recommended is @grip54 's collection:" La curation de contenus, la mémoire partagée d'une veille scientifique et sociétale
|
Scooped by
Gilbert C FAURE
Today, 11:28 AM
|
Earth at Night, illuminated by full moon - a dark sphere, with editing or longer exposure transformed to pale blue, show bright spots which are cities at night across iberian peninsula and Africa's mediterranean sea coast line & South America on the right.
Dark shot: https://lnkd.in/d6gmWgiF
Colorful shot https://lnkd.in/dr_4iEEm
#earthatnight #artemisII
|
Scooped by
Gilbert C FAURE
April 3, 11:38 AM
|
𝗪𝗵𝘆 𝗖𝗵𝗼𝗼𝘀𝗶𝗻𝗴 𝗕𝗲𝘁𝘄𝗲𝗲𝗻 𝗔𝗰𝘁𝗶𝗼𝗻 𝗥𝗲𝘀𝗲𝗮𝗿𝗰𝗵 𝗮𝗻𝗱 𝗖𝗮𝘀𝗲 𝗦𝘁𝘂𝗱𝘆 𝗖𝗮𝗻 𝗠𝗮𝗸𝗲 𝗼𝗿 𝗕𝗿𝗲𝗮𝗸 𝗬𝗼𝘂𝗿 𝗧𝗵𝗲𝘀𝗶𝘀.
Many graduate students weaken their thesis by confusing 𝗮𝗰𝘁𝗶𝗼𝗻 𝗿𝗲𝘀𝗲𝗮𝗿𝗰𝗵 with 𝗰𝗮𝘀𝗲 𝘀𝘁𝘂𝗱𝘆—yet the two serve fundamentally different academic purposes.
𝗔𝗰𝘁𝗶𝗼𝗻 𝗿𝗲𝘀𝗲𝗮𝗿𝗰𝗵 is initiated to solve an 𝗶𝗺𝗺𝗲𝗱𝗶𝗮𝘁𝗲 𝗽𝗿𝗼𝗯𝗹𝗲𝗺 It focuses on 𝗶𝗺𝗽𝗹𝗲𝗺𝗲𝗻𝘁𝗶𝗻𝗴 𝘀𝗼𝗹𝘂𝘁𝗶𝗼𝗻𝘀, often within the 𝗳𝗶𝗲𝗹𝗱 𝗼𝗳 𝗲𝗱𝘂𝗰𝗮𝘁𝗶𝗼𝗻, where researchers may also 𝗮𝗰𝘁 𝗮𝘀 𝗽𝗮𝗿𝘁𝗶𝗰𝗶𝗽𝗮𝗻𝘁𝘀 in the research process. This approach is practical, intervention-based, and solution-oriented.
𝗖𝗮𝘀𝗲 𝘀𝘁𝘂𝗱𝘆, by contrast, involves 𝗶𝗻-𝗱𝗲𝗽𝘁𝗵 𝗮𝗻𝗮𝗹𝘆𝘀𝗶𝘀 of a 𝗽𝗮𝗿𝘁𝗶𝗰𝘂𝗹𝗮𝗿 𝗲𝘃𝗲𝗻𝘁 𝗼𝗿 𝗰𝗮𝘀𝗲 𝗼𝘃𝗲𝗿 𝗮 𝗹𝗼𝗻𝗴 𝗽𝗲𝗿𝗶𝗼𝗱 𝗼𝗳 𝘁𝗶𝗺𝗲. It emphasizes 𝗼𝗯𝘀𝗲𝗿𝘃𝗶𝗻𝗴 𝗮𝗻𝗱 𝗮𝗻𝗮𝗹𝘆𝘀𝗶𝗻𝗴 𝗮 𝘀𝗶𝘁𝘂𝗮𝘁𝗶𝗼𝗻, is 𝘂𝘀𝗲𝗱 𝗶𝗻 𝗺𝗮𝗻𝘆 𝗳𝗶𝗲𝗹𝗱𝘀, and 𝗱𝗼𝗲𝘀 𝗻𝗼𝘁 𝗽𝗿𝗼𝘃𝗶𝗱𝗲 𝗮 𝘀𝗼𝗹𝘂𝘁𝗶𝗼𝗻 𝘁𝗼 𝗮 𝗽𝗿𝗼𝗯𝗹𝗲𝗺. Researchers typically 𝗱𝗼 𝗻𝗼𝘁 𝘁𝗮𝗸𝗲 𝗽𝗮𝗿𝘁 in the research setting.
Misunderstanding this distinction leads to flawed methodology, weak research design, and inconsistent findings—common issues in rejected proposals.
📲 If you need thesis help, WhatsApp DocAdeson on: +14243487554
♻️ find this useful? follow + like + repost + comment.
#DrAdeson #AcademicResearch #ResearchMatters #ResearchCommunity #AcademicWriting #PhDLife #PostdocLife #GradSchool
|
Scooped by
Gilbert C FAURE
April 3, 11:29 AM
|
LLM are ok for medical diagnoses, but AI chatbots for public are not. LLMs complete the scenarios accurately, correctly identifying conditions in 94.9% of cases and disposition in 56.3% on average. However, participants using the same LLMs identified relevant conditions in fewer than 34.5% of cases and disposition in fewer than 44.2%, both no better than the control group.
https://lnkd.in/gRCrBSkE
#LLM #AI
|
Scooped by
Gilbert C FAURE
April 2, 1:18 PM
|
Reddit veut scanner votre oeil pour vous laisser écrire un commentaire. On dirait de la science-fiction. Mais non. C'est une annonce officielle du 25 mars 2026.
Reddit a un problème. Des millions de faux comptes automatisés envahissent la plateforme. Des programmes qui postent, commentent, likent à la place de vrais utilisateurs. Reddit en supprime 100 000 par jour. Et ça ne suffit plus. Digg, son ancien concurrent, vient de fermer, submergé par les machines.
La solution ? Demander aux comptes suspects de prouver qu'un humain se cache derrière. Avec du Face ID, des passkeys, ou même World ID, un système qui scanne votre iris.
Le détail à connaître : World ID, c'est un projet cofondé par Sam Altman, le CEO d'OpenAI. Le même Sam Altman qui a investi plus de 60 millions de dollars dans Reddit, qui a siégé à son conseil d'administration pendant sept ans, et qui possède plus d'actions que le CEO de la plateforme. Sa participation vaut plus d'un milliard de dollars. Léger conflit d'intérêt.
Nous voilà donc coincés entre deux risques : laisser mourir l'internet ouvert sous les faux comptes, ou le sauver en confiant nos données biométriques au propriétaire de ChatGPT.
Je n'ai pas la solution. Mais si la seule façon de prouver qu'on est humain, c'est de sacrifier son anonymat, alors on a un sérieux problème de conception. | 36 comments on LinkedIn
|
Scooped by
Gilbert C FAURE
April 1, 10:39 AM
|
Not all evidence is created equal—and in public health, that distinction saves lives.
Understanding clinical study designs is the foundation of evidence-based decision-making:
1. Observational studies (cohort, case-control, cross-sectional) help us detect patterns, associations, and disease burden—critical for surveillance and hypothesis generation. 2. Experimental studies (randomized vs non-randomized) go a step further—establishing causality through controlled intervention.
But here’s the nuance many overlook:
✔️ Cohort studies track exposure → outcome (powerful for incidence & risk) ✔️ Case-control studies work backward (efficient for rare diseases) ✔️ Cross-sectional studies provide snapshots (essential for prevalence) ✔️ Randomized trials minimize bias and remain the gold standard—but are not always feasible in real-world public health settings
The real expertise lies not in choosing the “best” design—but in choosing the right design for the right question.
In an era of data abundance and rapid policy decisions, strengthening our understanding of study designs is not optional—it is a professional responsibility.
#Epidemiology #PublicHealth #EvidenceBasedPractice #ClinicalResearch #DataScience #GlobalHealth #ResearchMethods
|
Scooped by
Gilbert C FAURE
April 1, 6:12 AM
|
Introducing my new white paper: The myth of the academic superstar - or why name disambiguation is crucial
|
Scooped by
Gilbert C FAURE
April 1, 3:59 AM
|
What should your organization know about the FedNow Service, an instant payment infrastructure developed by the Federal Reserve?
These resources from Federal Reserve Financial Services are a great starting point. Get up to speed on the basics about these innovative payments, how they work and the benefits they offer: https://bit.ly/47zdgnJ
|
Scooped by
Gilbert C FAURE
April 1, 3:57 AM
|
More than 9,000 researchers published at least 72 papers in a single year - more than one paper every five days - in one or more of the years between 2019 and 2024.
When a “more conservative threshold” of 40 papers a year was applied, so called "hyper-prolific authors" increased in number by 66 per cent, from 2,517 in 2019 to 4,189 in 2023, a 2025 study found, against a wider increase in publications of 15 per cent over that period.
Last year, Clarivate excluded 432 authors from its latest Highly Cited Researchers list in response to concerns over “extreme levels of publication relative to field baselines”.
There is also a “growing trend of multiple institutional affiliations”, often across different countries, with “some authors listing affiliations with more than 20 institutions”.
Concerns abound that authors who publish on a weekly basis are cutting corners, corrupting authorship norms and overburdening the peer review system – with AI likely to make matters worse. But if incentives are misaligned, what can be done? And is the moral panic exaggerated? Jack Grove reports for Times Higher Education. https://lnkd.in/embsRNmZ | 45 comments on LinkedIn
|
Scooped by
Gilbert C FAURE
March 31, 6:26 AM
|
Partager Javiera Atenas Javiera Atenas est maîtresse de conférences à la faculté de commerce, d’arts, de sciences sociales et de technologie de l’université du Suffolk, au Royaume-Uni. Elle dirige le certificat postuniversitaire en pratique pédagogique et enseigne l’analyse et la visualisation des...
|
Scooped by
Gilbert C FAURE
March 31, 4:18 AM
|
Une information captée et non partagée est une information perdue. C'est l'un des angles morts les plus fréquents que je rencontre dans les organisations : des équipes qui veillent, des signaux qui remontent — et des décideurs qui n'en voient qu'une infime partie.
La veille sans circuit de diffusion n'est pas de la veille. C'est de la collection.
Trois questions pour tester votre organisation : ➡️ Qui reçoit vos synthèses de veille aujourd'hui ? ➡️ Sous quelle forme ? À quelle fréquence ? ➡️ Est-ce que ça change quelque chose dans les décisions prises ?
Si vous n'avez pas de réponse claire à ces trois questions, c'est le point de départ. Fresque de la connaissance et atelier collaboratif pour passer clairement à l'action.
C'est exactement ce qu'on travaille le 20 mai avec ceux qui veillent — et le 28 mai avec ceux qui décident.
👉 lien d'inscription suite à vos commentaires ou vos MP.
#décision #veilleur #veille #compétences #fresque #collaboratif #pragmatique
|
Scooped by
Gilbert C FAURE
March 30, 8:29 AM
|
🎨 NotebookLM frappe encore fort ! Les infographies avec styles personnalisés sont disponibles. 10 styles prédéfinis (éditorial, argile, kawaii…) + la possibilité de créer les vôtres via un simple prompt. Vos documents transformés en visuels percutants en un clic. 👉
|
Scooped by
Gilbert C FAURE
March 29, 4:06 AM
|
Quelle IA collecte le plus de données personnelles ?
D’après l’étude de Surfshark, Meta AI est l’IA qui collecte le plus de données personnelles.
Elle couvre 33 types de données sur 35, soit presque tout ce qu’il est possible de récupérer.
C’est aussi l’une des rares à inclure des données sensibles, comme les informations financières ou certaines données personnelles.
Derrière, d’autres outils comme Gemini collectent aussi des données sensibles. ChatGPT a élargi sa collecte ces derniers mois, mais reste en dessous.
Claude est plutôt parmi les plus sobres.
À retenir : vos conversations ne sont pas anodines. Elles peuvent être stockées, analysées, parfois utilisées à d’autres fins.
Un bon réflexe : éviter d’y partager des informations sensibles.
|
|
Scooped by
Gilbert C FAURE
Today, 11:33 AM
|
Most people think a drug works simply because of what is inside it. In reality, how it enters your body can be just as important.
This is called the route of drug administration—oral, intravenous, intramuscular, subcutaneous, inhalational, topical—and each route changes how a drug behaves inside you.
Why does this matter for the general population?
Because the same drug can act very differently depending on how it is taken: • A tablet may take 30–60 minutes to act, while an injection can work within seconds • Some drugs are destroyed in the stomach and must never be taken orally • Incorrect use (like crushing sustained-release tablets) can lead to toxicity • Inhalers, if used incorrectly, may deliver almost no benefit despite regular use
In simple terms: right drug + wrong route = wrong outcome
As an MD trainee in Medical Pharmacology, this is not just academic knowledge for me—it is a responsibility.
We often see: • Antibiotic misuse due to wrong administration practices • Poor control of chronic diseases because of improper drug use • Adverse drug reactions that could have been prevented with basic awareness
Educating people about how to take medicines correctly can: • Improve treatment outcomes • Reduce side effects • Prevent drug resistance • Empower patients to participate in their own care
Pharmacology is not just about drugs—it is about optimizing how those drugs interact with human biology.
And sometimes, the smallest detail—like the route of administration—makes the biggest difference.
#Medicine #Pharmacology #PatientEducation #RationalUseOfMedicines #Healthcare #MedicalEducation
|
Scooped by
Gilbert C FAURE
Today, 11:25 AM
|
Claude 4.5 est devenu une vraie suite de travail.
Et si vous êtes dirigeant, voici l’essentiel à retenir :
✦ 1. Choisissez le bon modèle Opus 4.5 → stratégie, raisonnement complexe, sujets à fort enjeu Sonnet 4.5 → rédaction, synthèse, usage business quotidien Haiku 4.5 → tâches simples, vitesse, gros volumes
Ne demandez pas la même chose à tous les modèles.
C’est comme utiliser le même véhicule pour livrer un colis… ou traverser un désert.
✦ 2. Claude ne sert pas qu’à écrire
Il peut aussi :
→ chercher sur le web → analyser des fichiers → travailler avec Google Drive → gérer des Projects → coder avec Claude Code → produire des livrables avec Artifacts
Le sujet n’est plus “fais-moi un texte”.
Le sujet est : fais avancer mon travail.
✦ 3. Les fonctions les plus utiles pour un dirigeant
Artifacts → produire un plan d’action, une SOP, une proposition, un tableau
Web Search → veille, préparation de rendez-vous, analyse concurrentielle
Analyse de fichiers → comprendre vos ventes, leads, devis, marges
Projects → créer une mémoire de travail par sujet : marketing, commercial, recrutement, direction
Claude Code → prototyper un outil interne, accélérer un projet digital, créer un MVP
Téléversement de fichiers → donner PDF, contrats, comptes rendus, captures d’écran pour qu’il analyse votre réalité
✦ 4. Les meilleurs cas d’usage concrets
→ préparer un rendez-vous commercial en 3 minutes → transformer une réunion en décisions + tâches + priorités → analyser un fichier de ventes ou de leads → créer une proposition commerciale plus vite → structurer un plan 90 jours → transformer des documents dispersés en système clair
✦ 5. La vraie méthode
Le débutant dit :
“Résume-moi ça.”
L’utilisateur avancé dit :
→ voici mon objectif → voici mes fichiers → voici mes contraintes → pose-moi les questions manquantes → propose un plan → exécute → améliore la V1
En clair : il ne “prompt” pas.
Il manage Claude comme un collaborateur.
C’est ça, la vraie mise à jour.
---------------------------------
PS : j’organise une masterclass le 10 avril à 20H pour vous aider à faire votre mise à jour Claude et apprendre à l’utiliser pour vous libérer de l’opérationnel, réduire votre surcharge mentale et accélérer la croissance de votre entreprise.
👉 Inscription ici :
https://lnkd.in/eN6mBd9B | 12 comments on LinkedIn
|
Scooped by
Gilbert C FAURE
April 3, 11:33 AM
|
PhD Students - How to check if your research idea is actually new?
First, let's understand why novelty is important for research
Here is what reviewers will look for in your research
1️⃣ 𝐍𝐨𝐯𝐞𝐥𝐭𝐲 → Is it new? 2️⃣ 𝐒𝐢𝐠𝐧𝐢𝐟𝐢𝐜𝐚𝐧𝐜𝐞 → Is it important for anyone? 3️⃣ 𝐌𝐞𝐭𝐡𝐨𝐝𝐨𝐥𝐨𝐠𝐲 → Is it conducted the right way? 4️⃣ 𝐕𝐞𝐫𝐢𝐟𝐢𝐜𝐚𝐭𝐢𝐨𝐧 → Can other researchers verify it? 5️⃣ 𝐏𝐫𝐞𝐬𝐞𝐧𝐭𝐚𝐭𝐢𝐨𝐧 → Is it presented in the right way?
You see novelty comes on the top of this list.
To confirm novelty, meet 𝐏𝐚𝐭𝐒𝐧𝐚𝐩 𝐄𝐮𝐫𝐞𝐤𝐚.
Eureka thinks like an IP expert.
Here is how it works.
1. Go to https://lnkd.in/dqiq55cM 2. Describe your research idea in 20-30 words 3. Eureka scans 200M+ patents to compare your idea 4. It shows you a side-by-side table of your idea vs existing ones 5. Export the entire novelty report to share with others
𝐖𝐡𝐲 𝐬𝐡𝐨𝐮𝐥𝐝 𝐲𝐨𝐮 𝐭𝐫𝐲 𝐢𝐭?
✓ Confirms the novelty of your research idea ✓ Gives you confidence in your research direction ✓ Change research idea if it's not novel ✓ After confirmation, dive deep into your research
🎗️ Try Eureka for FREE: https://lnkd.in/dqiq55cM
❄️ Anything you'd like to add?
#phd #research | 16 comments on LinkedIn
|
Scooped by
Gilbert C FAURE
April 2, 1:21 PM
|
L'IA s'est installée dans les outils médicaux du quotidien. Pas par décision collective. Par glissement progressif.
On vous présente le Top 10 de ses usages en santé 📊
On a voulu faire le point. Pas sur les promesses, sur ce qui existe réellement. Sur ce qui est déployé, ce qui est encore en cours, et ce qui reste à construire.
Ce carrousel recense les 10 usages de l'IA les plus documentés pour les soignants en 2025-2026. L'IA n'a pas attendu d'être invitée. Elle s'est glissée dans les logiciels de prescription, les DPI, les outils de documentation, les systèmes d'alerte. Par touches. Par intégrations successives. Et sans que la formation des soignants n'anticipe ce deplacement
Ce que le panorama montre, c'est une réalité à deux vitesses.
D'un côté, des outils qui tiennent leurs promesses. ▪️Le scribing IA divise par deux le temps de documentation ▪️La veille bibliographique qui prenait une heure se fait dorenavant en 15 minutes. A condition de bien savoir la réaliser avec l’IA. ▪️L'aide à la décision clinique, quand elle est couplée au jugement médical, produit de meilleurs diagnostics que l'un ou l'autre seul.
Le paradoxe est net. Les outils fonctionnent. Les gains sont mesurables Et pourtant, 45 % des soignants n'informent jamais leurs patients qu'ils utilisent l'IA dans leur prise en charge. Alors que 84 % des Français souhaitent l'etre.
Ce n'est pas de la mauvaise volonté. C'est l'absence de cadre. On automatise des taches, mais on ne forme pas à ce que ca engage, puis on déploie des outils. Mais on ne prepare pas le soignant à évaluer ce qu'il délègue, à identifier là ou l'outil échoue, à maintenir son jugement clinique souverain face à une recommandation algorithmique.
La formation n'est pas un accessoire du déploiement. C'est sa condition de validité.
C'est exactement ce qu'Elliacare adresse. Pas l'enthousiasme pour les outils. La compétence pour les utiliser, et pour savoir quand ne pas s'y fier.
👉 Swipez le carrousel pour voir les 10 usages, leurs niveaux de maturité et les chiffres qui les étayent.
Sur ces 10 usages, lesquels faites-vous déjà, et lesquels vous manquent encore ?
#IAenSanté #Elliacare #MédecineAugmentée #FormationIA | 19 comments on LinkedIn
|
Scooped by
Gilbert C FAURE
April 2, 1:13 PM
|
The number of women working as scientists and engineers in the EU reached 7.9 million in 2024, representing 40.5% of the scientists and engineers’ workforce across all economic activities. 🧑🔬🔬
Across EU regions, highest shares in:
🇪🇸 Canarias (58.8%) 🇵🇹 Região Autónoma dos Açores (57.3%) 🇵🇹 Madeira (56.4%)
Lowest in:
🇭🇺 Közép-Magyarország (30.0%) 🇫🇮 Manner-Suomi (30.7%) 🇮🇹 Sud (31.1%)
ℹ️ Please note that the map includes available regional data from EU countries, EFTA and candidate countries. The ranking in the caption of the post is based on data from EU countries only.
Learn more 👉 https://lnkd.in/eHQqWP_g
|
Scooped by
Gilbert C FAURE
April 1, 10:15 AM
|
|
Scooped by
Gilbert C FAURE
April 1, 4:46 AM
|
Chez Prisma Media, 40% des articles sont générés par IA. Les journalistes sont clonés numériquement pour produire des vidéos. Personne ne vous l'a dit. 📉
Le Monde a signé avec Meta et OpenAI pour intégrer ses contenus dans les assistants IA. TF1 expérimente la production assistée. L'AFP fournit le fil que les agents IA résument et redistribuent. Aux États-Unis, 9% des articles sont déjà partiellement écrits par l'IA, sans le mentionner.
3 434 postes de journalistes supprimés en 2025. En 2026, le rythme s'accélère. Washington Post, Politico, Wall Street Journal : touchés.
Ce qui est en train de mourir dans le journalisme : 👉 L'article de routine. Résultat sportif, cours de bourse, météo : un agent IA rédige ça en 10 secondes. Associated Press produit 730 000 articles automatisés par an. Le journaliste qui couvre le factuel affronte une machine qui ne dort jamais. 👉 Le média comme intermédiaire unique. Quand le lecteur pose une question à un assistant IA qui puise dans Le Monde, l'AFP et 200 autres sources, il n'a plus besoin de visiter le site du journal. Le trafic baisse. La pub baisse. Le modèle économique s'effrite. 👉 La confiance par défaut. 9% d'articles partiellement IA sans disclosure. Des journalistes clonés chez Prisma. Et seulement 12% des lecteurs sont à l'aise avec un contenu 100% IA. Le jour où le lecteur ne sait plus qui écrit, il décroche.
Le test de survie du Monde, de TF1, de Prisma Media et de l'AFP : 1️⃣ Miser sur l'enquête, pas sur le flux. L'article factuel meurt. L'investigation, l'analyse, le décryptage : c'est ce que l'IA ne sait pas faire. Le journaliste qui survit est celui qui va sur le terrain, pas celui qui reformule une dépêche. 2️⃣ Devenir la source de confiance des agents IA. Le Monde a compris : si les assistants IA citent vos articles, vous devenez l'infrastructure de la vérité. Le média qui refuse de nourrir les LLM disparaît des réponses. Celui qui négocie sa place survit. 3️⃣ Assumer la transparence totale sur l'usage de l'IA. Le lecteur pardonne l'IA. Il ne pardonne pas le mensonge. Prisma Media produit 40% de contenu IA ? Très bien. Mais dites-le. Le média qui joue la transparence gagne la confiance.
Ma conviction : le journalisme ne mourra pas. Mais le journaliste qui produit du contenu que l'IA fait mieux, oui. Resteront les enquêteurs, les analystes, les éditorialistes. Ceux qui pensent.
Série "La Chute des Géants — Saison 3" [13/15]. Hier : Clifford Chance, Gide, Bredin Prat. Demain : HEC, ESSEC, INSEAD.
🚀 Dirigeants : votre veille est-elle humaine ou automatisée ? 👉 https://lnkd.in/e6k46944 🎓 Consultants : la communication IA est un nouveau terrain de jeu. 👉 https://lnkd.in/eaJd3bZ8 🎯 Masterclass gratuite le 16 avril : passez du prompting à l'IA agentique en 1h. 👉 https://lnkd.in/eZGGrvvY 🚀 Nos bootcamps tournent sur Claude. Un projet IA ? 👉 https://decisionia.com/rdv Vous lisez des articles écrits par l'IA sans le savoir. Ça vous pose un problème ? 👇 | 12 comments on LinkedIn
|
Scooped by
Gilbert C FAURE
April 1, 3:59 AM
|
Are boys in ‘crisis’ — and is the manosphere playing a part? My new feature for Nature Magazine looks at data on boys and young men, including education, health and attitudes. And it asks whether talk of a male crisis risks fueling hostility towards, or sidelining, women and girls. https://lnkd.in/eJGhkXuA
The data and interviews suggest that: - Globally, more boys than girls are out of school; young men are less likely to attend higher education.
- Injuries — from road accidents, violence, self-harm — are strikingly higher for male adolescents. More boys than girls die by suicide.
- Mental health disorders are a large and growing problem for boys and girls.
- Stereotypical ideas of masculinity are common e.g. that men must be tough, self-sufficient, financial providers and in control in relationships. In one survey, 63% of young men said they regularly engaged with a masculinity or men influencer. But research on the manosphere and its impact is still limited.
It’s uncomfortable & controversial to talk about ‘boys in crisis’ in the face of entrenched and worsening discrimination against girls and women. Many things are worse for adolescent girls.
The message I heard was: it's important to understand the challenges that all young people are facing. Many thanks to the researchers & experts who spoke to me about this important topic – one that I was particularly interested to report on as the mum of three boys. | 15 comments on LinkedIn
|
Scooped by
Gilbert C FAURE
April 1, 3:56 AM
|
A few days ago, over lunch with some senior academics, I heard a really meaningful saying:
“學術不是忙出來的,而是閒出來的” (Science doesn't come from being constantly busy— it emerges from having periods of idleness.)
That instantly reminded me of this powerful insight on creativity in science from Max Perutz, a Nobel laureate: "Creativity in science, as in art, cannot be organized. It arises spontaneously from individual talent. Well-run laboratories can foster it, but hierarchical organizations, inflexible bureaucratic rules, and mountains of futile paperwork can kill it. Discoveries cannot be planned—they pop up, like Puck, in unexpected corners."
In a world full of endless meetings, grant deadlines, metrics, and "productivity" pressure, these two thoughts hit hard.
Real breakthroughs often come not from grinding harder, but from protecting unstructured time—time to think, wander, connect dots that no schedule could predict.
前幾天跟幾位資深學者在午餐聊天時,聽到一句非常有意思的話:
「學術不是忙出來的,而是閒出來的。」
這句話瞬間讓我想起諾貝爾獎得獎人Max Perutz關於科學創造力的一段深刻洞見:
「Creativity in science, as in art, cannot be organized. It arises spontaneously from individual talent. Well-run laboratories can foster it, but hierarchical organizations, inflexible bureaucratic rules, and mountains of futile paperwork can kill it. Discoveries cannot be planned—they pop up, like Puck, in unexpected corners.」
在如今充斥著無止盡會議、grant/report deadlines、各種KPIs,以及「productivity」壓力的學術世界裡,這兩段話特別打動人心。
真正的突破,往往不是靠更拼命地埋頭苦幹,而是來自於那些能讓人自由地思考的空間。
#難道要等收到通知那天我才走
|
Scooped by
Gilbert C FAURE
March 31, 4:24 AM
|
Most people are still using AI like a chatbot.
That’s the biggest mistake in 2026. ⚠️
Because AI is evolving into something much bigger…
👉 Agentic AI— systems that don’t just respond but think, plan, and execute tasks on their own.
If you understand this, you’re already ahead of 99% 🚀
Here’s the simple breakdown of how modern AI works:
🔹 AI & ML Foundations – NLP, Deep Learning, Transformers – The core that powers everything
🔹 Gen AI Layer – Text, Image, Audio, Video generation – Prompt Engineering + RAG
🔹 AI Agents – Tool usage & automation – Memory + decision-making – Multi-step task execution
🔹 Agentic AI (Next Level) – Autonomous systems – Goal-based execution – Self-improving workflows
---
💡 In simple words: We are moving from asking AI → to assigning work to AI
⚡ If you learn this now: You won’t just use AI You’ll build systems that work for you 24/7
📌 Save this before it disappears
🔁 Repost to help others learn AI
👥 Tag someone who needs to see this
💬 What’s your take on Agentic AI?
🚀 Follow Harish Kumar for more AI insights
#AI #ArtificialIntelligence #GenAI #AIAgents #FutureOfWork #Automation #TechTrends #AI2026. | 125 comments on LinkedIn
|
Scooped by
Gilbert C FAURE
March 31, 4:13 AM
|
Some learning sticks because it’s clear. Some sticks because it’s repeated.
But some stays with us simply because we built it ourselves.
Our latest blog explores the 𝗜𝗞𝗘𝗔 𝗘𝗳𝗳𝗲𝗰𝘁 and why effort increases ownership. When people contribute, solve, or create, learning stops feeling like something delivered to them and starts feeling like something they own.
That shift matters.
Because effort changes the relationship people have with what they learn. It deepens engagement. It strengthens memory. And most importantly, it makes people far more likely to use it.
In learning design, the goal isn’t just to make things easy. It’s to make space for contribution.
When learners build, decide, and shape outcomes, even in small ways, the experience becomes personal. And personal learning is the kind that lasts.
📌 Write to 𝗲𝗹𝗲𝗮𝗿𝗻𝗶𝗻𝗴@𝗹𝗲𝗮𝗿𝗻𝗻𝗼𝘃𝗮𝘁𝗼𝗿𝘀.𝗰𝗼𝗺 to craft learning that transforms behaviour.
#LearningDesign #LearningScience #WorkplaceLearning #InstructionalDesign
https://lnkd.in/eMnCi9Nx
|
Scooped by
Gilbert C FAURE
March 30, 3:42 AM
|
PARHAF, a human-authored corpus of clinical reports for fictitious patients in French Xavier Tannier Sorbonne Université, Université Sorbonne Paris Nord, Inserm, Limics, F-75006 Paris, France Salam Abbara Université Paris-Saclay, UVSQ, Assistance Publique-Hôpitaux de Paris, Raymond Poincaré University Hospital, Infectious Disease Department, Garches, France Yonsei University College of Medicine, Gangnam Severance Hospital, Department of Laboratory Medicine, Seoul, South Korea Rémi Flicoteaux Assistance Publique-Hôpitaux de Paris, Department of medical information, Paris, France Youness Khalil Health Data Hub, 75015, Paris, France Aurélie Névéol Université Paris-Saclay, CNRS, LISN, 91400, Orsay, France Pierre Zweigenbaum Université Paris-Saclay, CNRS, LISN, 91400, Orsay, France Emmanuel Bacry Health Data Hub, 75015, Paris, France Université Paris-Dauphine, PSL, CNRS, CEREMADE, 75016, Paris, France Abstract The development of clinical natural language processing (NLP) systems is severely hampered by the sensitive nature of medical records, which restricts data sharing under stringent privacy regulations, particularly in France and the broader European Union. To address this gap, we introduce PARHAF, a large open-source corpus of clinical documents in French. PARHAF comprises expert-authored clinical reports describing realistic yet entirely fictitious patient cases, making it anonymous and freely shareable by design. The corpus was developed using a structured protocol that combined clinician expertise with epidemiological guidance from the French National Health Data System (SNDS), ensuring broad clinical coverage. A total of 104 medical residents across 18 specialties authored and peer-reviewed the reports following predefined clinical scenarios and document templates. The corpus contains 7,394 clinical reports covering 5,009 patient cases across a wide range of medical and surgical specialties. It includes a general-purpose component designed to approximate real-world hospitalization distributions, and four specialized subsets that support information-extraction use cases in oncology, infectious diseases, and diagnostic coding. Documents are released under a CC-BY open license, with a portion temporarily embargoed to enable future benchmarking under controlled conditions. PARHAF provides a valuable resource for training and evaluating French clinical language models in a fully privacy-preserving setting, and establishes a replicable methodology for building shareable synthetic clinical corpora in other languages and health systems. Corresponding author: Xavier Tannier, xavier.tannier@sorbonne-universite.fr 1 Background & Summary 1.1 Context and Motivation Much of the information in electronic health records is conveyed by text such as clinical notes and discharge summaries (see, e.g., [17]). Natural language processing aims to unlock that information and make it available for downstream tasks. Publicly available clinical text corpora are a key asset to design, tune, and evaluate clinical natural language processing systems [9]. Sharing clinical text is, however, difficult: the tension between individual data privacy and corpus distributability has been widely acknowledged as the central obstacle to making clinical corpora publicly available [3]. Some resources have been made available for U.S. English clinical NLP over the past few decades [9], starting with the 2007 Computational Medicine Challenge [29] and the i2b2 series of clinical NLP shared tasks [34], many of which relied on the MIMIC database [30, 16]. However, access hurdles remain particularly salient in the French context. The European regulatory framework, among the most protective of health data, imposes severe restrictions on the circulation and secondary use of medical records. This creates a marked scarcity of open and usable corpora of French medical reports [27]. Beyond data access, models trained on clinical reports may themselves become sensitive, as they can memorize patient information during training, making the sharing of trained models legally and ethically challenging [2]. Together, these factors create a fragmented ecosystem in which institutions and research teams operate in isolation, unable to effectively pool data or models. This combination of restricted data access, model sensitivity, lack of open resources, and resulting fragmentation severely limits the development and robust evaluation of NLP systems applied to French clinical text. This creates the following challenge: Given this privacy bottleneck, how can a large, realistic, and fully privacy-preserving corpus of clinical reports be created to help clinical language processing research and development while being freely shareable? To address this challenge, NLP researchers have studied de-identification methods that remove personally identifying information from original clinical text [34, 26, 11, 5, 4] and used them to de-identify clinical datasets such as that in MIMIC [26]. However, the resulting text is pseudonymized (directly identifying information has been removed) but not anonymized (there is no guarantee that reidentification is impossible). This prevents it from being freely distributed. In the United States, the MIMIC database can be shared under a stringent data-use agreement, but it remains unclear whether this protocol is compatible with E.U. regulation. For this reason, clinical document collections in French (e.g., the MERLOT corpus [3] and other corpora extracted from French clinical data warehouses [15, 33]) were used in evaluation studies with explicit targeted ethical board approval but could not be shared due to privacy restrictions. Hahn [12] notes that, faced with the non-shareability of real patient records, NLP researchers have developed a variety of proxies for clinical text. One type of proxy is machine translation of English clinical datasets: Becker et al. [1] translated into German the ShARe/CLEF eHealth 2013 training dataset based on MIMIC-II data [30], Neves et al. [28] translated some clinical cases from English into French (with a focus on evaluating the performance of machine translation for measures and acronyms) and Frei et al. [7] translated into German the 2018 n2c2 shared task dataset that reused data from MIMIC-III [16]. However, translated text requires thorough human review, and cultural and health system differences make the resulting text sensibly different from native clinical text. Another proxy is synthetic clinical text. GraSCCo [24] manually edited 63 deidentified German discharge summaries and case reports at multiple linguistic levels to make reidentification virtually impossible. Recent efforts have also explored the use of autoregressive generative language models to produce synthetic clinical documents in English [14], French [13], German [8], Swedish and Spanish [35]. Nonetheless, the balance between privacy and utility of the resulting material needs further analysis [21, 6]. Published case reports are a more distant proxy for clinical text, but their open-source status and existence in multiple languages have made them particularly attractive for clinical NLP. Case reports have been collected, for instance, in the following corpora: CAS [10] (French), E3C [20] (Italian, English, French, Spanish, and Basque), CANTEMIST [22] and DISTEMIST [23] (Spanish), and FRASIMED [36] (French translations of CANTEMIST and DISTEMIST). The style of case reports, however, is quite different from that of electronic health records. The closest proxy for true clinical texts is those written by health care professionals about fictitious patients, for instance, in medical textbooks or course material. The JSynCC corpus [19] extracted 400 operative reports and 470 case reports from such textbooks. The initial copyright on the textbooks, though, prevents the free distribution of the corpus. The PARROT corpus [18] contains 2,658 radiology reports about fictitious patients, including 475 in French, written on a volunteer basis by healthcare professionals from 21 countries. This endeavor was made possible through human networking, including leveraging professional radiological societies, which may be difficult to scale up to a diversity of medical specialties. 1.2 Objectives and Contributions To overcome this limitation, the approach adopted in the present work was to ask healthcare professionals to write new clinical reports describing fictitious patients specifically for the creation of a shareable corpus, and to distribute these reports under an open license. Because the reports are created for this purpose and do not derive from real patient data, they are anonymous and shareable by design. However, this approach raises important methodological questions: how can such reports be generated in a way that ensures both medical realism and statistical representativeness while preserving privacy? To address this challenge, we designed a corpus creation protocol that leverages clinicians’ expertise while being guided by public health statistics, principles of corpus development [31, 37], and a set of predefined clinical scenarios. The protocol was implemented using a large pool of French-speaking clinicians through a partnership with associations and unions of medical residents across multiple medical and surgical specialties, which recruited 104 residents as report authors. Guidelines were developed for selecting clinical cases, using data from the French National Health Data System (SNDS [25]) as reference scenarios for report creation, and the residents authored synthetic medical reports following these guidelines. The resulting open-source French-language corpus can now be used to train and evaluate language models on targeted medical use cases. In this article, we introduce this open-source corpus of French clinical documents. PARHAF comprises 7,394 expert-authored clinical reports describing 5,009 realistic yet fictitious patient cases. Each case is accompanied by structured documentation of the underlying clinical scenario, including the primary diagnosis, main procedure, care pathway, and discharge information when applicable. We further provide three specialized subsets specifically designed to support information extraction tasks in oncology and infectious diseases. This corpus offers a valuable resource for the development and evaluation of clinical NLP models, directly tackling the root cause of all the challenges outlined above : the inherently sensitive nature of clinical data. We release 6,185 documents corresponding to 4,254 fictitious patients under an open-source CC-BY license. The remaining portion of the corpus will be temporarily embargoed to enable future evaluations under controlled conditions, thereby limiting the risk of large language model contamination through prior exposure to the data. 1.3 Intended uses of PARHAF This corpus is intended for research, development, and educational purposes in clinical natural language processing. It enables the sharing of clinical-style notes and annotations and supports community-wide pooling of efforts around a common, openly accessible resource. The corpus is suitable for benchmarking French medical language models, including large language models, and for conducting reproducible clinical NLP research under controlled and privacy-safe conditions. The corpus also supports uses for medical teaching, such as training medical students and residents in structured clinical report writing, diagnostic reasoning, and clinical information synthesis. It can also serve as a resource for clinical case preparation and supports training in clinical natural language processing using realistic yet fictitious reports without exposure to sensitive patient data. The corpus further enables privacy-preserving data augmentation, either as a standalone resource or as a complement to restricted-access clinical datasets, provided its fictitious nature is explicitly acknowledged. Finally, the representativeness of part of the corpus is geared towards three use cases of the PARTAGES project, allowing methodological comparisons across these specific use cases. 1.4 Limitations and non-intended uses Although efforts were made to create a diverse corpus that includes a variety of document types and clinical specialties, the corpus does not cover all specialties and variations of French clinical text. This corpus is intended for research purposes only, specifically for training and evaluating natural language processing models on French clinical text. It is not a substitute for clinically validated data and must not be used to support regulatory approval, clinical certification, or deployment decisions in real healthcare settings. It is not suitable for clinical use. It cannot be used for clinical decision-making, diagnosis, prognosis, treatment, or patient care. Models trained or evaluated on this data are not clinically validated, and results obtained on this corpus cannot be presented as evidence of clinical performance or safety. The corpus does not support generalization claims to real hospitals, regions, or clinical practices, nor does it allow epidemiological or population-level inference, as its distributions do not reflect real-world prevalence. It is also unsuitable for longitudinal studies or for assessing real-world clinical risk or safety, including rare adverse events or edge cases, and must not be used as a replacement for real clinical data in deployment settings. Finally, the corpus does not capture the operational constraints of real clinical environments (e.g., time pressure, workload, interruptions) and should not be used for stress-testing models under realistic clinical conditions. 2 Methods 2.1 Challenges Building the PARHAF corpus required addressing two main challenges. The first was ensuring that recruited physicians and collected texts adequately represent the relevant dimensions of clinical language, while remaining within concrete implementation constraints (a limited author pool, a fixed corpus size, and the specific use cases targeted by the project). The second was encouraging healthcare professionals to write reports that closely resemble real clinical documents, while minimizing the risk of privacy leaks. 2.2 Clinical Scenario Design For the above reason, we deemed essential to provide relatively precise guidelines to assist healthcare professionals in authoring clinical reports that closely resemble real-world documents while minimizing the risk of privacy breaches. These guidelines addressed both the content and the format of the documents. Given that the recruited physicians primarily worked in hospital settings, the resulting corpus of documents focused predominantly on hospital-based clinical situations. 2.2.1 Content Development The selection of clinical scenarios was guided by our goal of guaranteeing the representativeness of the clinical situations actually observed in French hospitals (see Section 2.3.1) and by the constraint of ensuring physicians were familiar with the clinical situation in relation to their specialty of practice or training. Both aspects were addressedusing hospitalization claims data available in the French National Health Data System (SNDS [25]). Scenarios were constructed by sampling observed distributions of Diagnosis-Related Groups (DRGs), principal diagnoses (ICD-10), age, sex, type of management (e.g., ambulatory surgery), admission and discharge modes (e.g., emergency department admission). DRGs were used (in a less formal format) to describe the type of hospitalization (e.g., surgery, medicine) and to map clinical cases to physicians’ qualifications (specialties). Secondary diagnoses were incorporated into the scenarios as a list of 10 randomly selected diagnoses from the pool of diagnoses frequently associated with the primary diagnosis-DRG pair. Patient name was randomly fixed. Based on these core elements, authors were encouraged to develop the clinical case details, enriching the content with relevant and realistic information, maintaining medical consistency with the baseline information, and ensuring depth and authenticity while adhering to principles of plausibility and ethics. 2.2.2 Document Format For document format, we aligned official recommendations with physicians’ actual practices to develop specialized templates for each type of hospitalization: • Medical hospitalization – Hospital discharge summary • Surgical hospitalization – Pre-operative consultation report (for scheduled admissions) – Operative report – Hospital discharge summary (for ambulatory surgery, a single document was requested combining both the operative report and the discharge summary) • Obstetrics (childbirth) – Pre-delivery hospitalization report (for high-risk pregnancies) or emergency department visit report (for low-risk pregnancies) – Delivery room report – Postpartum hospitalization report (maternity ward) • Oncology – Pathology report For discharge summaries, the template included: department name, reason for admission, medical history, surgical history, family history, allergies, lifestyle factors, treatment at admission, history of the present illness, clinical examination, complementary investigations, in-hospital course, discharge treatment, and conclusion. Similar minimal templates were developed for surgeries, obstetrics, and pathology reports. Authors were encouraged to follow these structures or to write in free text format, provided that all required information was included. Finally, a structured summary section was completed at the end of each report, in which the authors specified the primary diagnosis, length of stay, and associated diagnoses mentioned in the report. The use of generative artificial intelligence tools was discouraged because it could bias both the content and the stylistic features of the reports. 2.3 Document Type and Distribution Strategy This corpus is structured in two complementary components, targeting a total of 5,000 patients. The primary component (n = 3,900) includes patients across a wide range of medical specialties and is designed to maximize diversity and approximate representativeness, although the target size does not allow full coverage of the spectrum of possible clinical cases. The secondary component focuses on specific use cases (ICD-10 coding, oncology, and infectious diseases) and comprises patients selected outside the main distribution to support more targeted evaluation scenarios. 2.3.1 Core distribution To approximate real-world distributions of medical activity, we relied on diagnosis frequencies derived from SNDS [25], which provides exhaustive, nationwide hospital claims data and served as a proxy for the underlying epidemiological and care distribution across medical conditions. For the year 2024, the national claims database consisted of approximately 18 million hospitalizations drawn from the SNDS. From the data, we defined clinical cases as the association of a DRG, sex, age group, and length-of-stay group. With these associations, we created a sampling database of 100,000 different clinical cases. These cases covered around 4,000 distinct ICD-10 primary diagnoses. To ensure patient privacy and data confidentiality, the sampling strategy over this clinical cases distribution adheres to the principles of k-anonymity and l-diversity [32]. To preserve epidemiological realism while avoiding excessive over-representation of very frequent conditions, which would reduce clinical diversity in the corpus, we applied a square-root transformation to the empirical frequencies, yielding a preliminary sampling probability proportional to fi\sqrt{f_{i}}, where fif_{i} is the frequency of condition ii in the SNDS data. To further limit the dominance of the most common conditions, we capped this value at a maximum probability pmaxp_{\mathrm{max}} (corresponding to a 0.1% sampling chance) and renormalized, giving the final sampling probability for each condition: pi=min(pmax,fi)∑jmin(pmax,fj)p_{i}=\frac{\min(p_{\mathrm{max}},\,\sqrt{f_{i}})}{\sum_{j}\min(p_{\mathrm{max}},\,\sqrt{f_{j}})} In practice, this theoretical distribution required iterative adjustment to account for operational constraints. Because hired authors had uneven expertise across medical specialties, document production could not be distributed uniformly, and not all specialties could be covered. The final allocation therefore used the square-root-with-cap model as a guiding principle, with reallocation based on actual case availability and author capacity, while preserving broad clinical coverage. Figures S1 and S2 in the Supplementary Materials provide, respectively, a detailed breakdown of these adjustments by specialty and the number of cases written per author. 2.3.2 Specific Use Cases In addition to the documents from the initial distribution, four specific sets of reports were assembled. Each set was designed to address a specific clinical information extraction use case: Coding Surgery reports from digestive surgery, orthopedic surgery, traumatology, and urology were specifically collected for a use case on ICD-10 diagnostic coding. Identifying biomarkers in oncology Pathology reports containing descriptions of tumor biomarkers used to inform diagnosis, prognosis, and targeted therapy selection in oncology: tissue and genomic alterations such as mutations, amplifications, and protein expression. The use case for this dataset is the automatic identification of these biomarkers. Identifying the response to treatment in oncology Oncology consultation reports mentioning treatment response (complete response, partial response, stable disease, progressive disease, not applicable, or indeterminate). The associated use case aims at classifying RECIST-style information from these reports. Infectiology Reports describing infectious episodes (including bacteremia) along with the causative bacteria and the primary site of infection. Other use cases (pseudonymization and summarization) are planned for the PARTAGES project (described below). However, the documents dedicated to these tasks do not require a distribution that differs from that described in the previous section. 2.4 Implementation The PARTAGES project, funded by the French government under the France 2030 initiative (operated by Bpifrance), brings together a consortium of 32 partners, including research teams, public and private healthcare institutions, and AI-focused deeptech companies. Its aim is to develop open resources to support the emergence of generative AI solutions in healthcare. The creation of the PARHAF corpus of clinical reports is one of the consortium’s initiatives. The corpus development was initiated through a scoping phase involving NLP experts and physicians, aimed at balancing production volume with budgetary constraints. Writing time was estimated at 60 minutes per document: 45 minutes for the first author and 15 minutes for review, validation, and correction by a second expert. This estimation was based on the expertise of the physicians involved and consultation with twelve residents from different specialties, active within their respective residents’ associations. A maximum duration of 60 minutes per document was retained, with a gross hourly compensation of €40, corresponding to a maximum of €40 per completed and reviewed report. Recruitment was conducted through a temporary employment agency. A national outreach campaign targeted 21 residents’ associations across different specialties in France. Eleven associations disseminated the call for participation, representing the following specialties: internal medicine, infectious diseases, visceral surgery, obstetrics and gynecology, neurology, pulmonology, public health, urology, oncology, anesthesiology and intensive care, and anatomical pathology. The call for participation was further circulated via the project’s hospital partners and their associated medical networks, enabling the inclusion of residents from additional specialties: nephrology, hematology, orthopedics, pediatrics, gastroenterology and hepatology, and cardiology. More than 500 applications were received, reflecting strong engagement from the medical community. A final panel of 104 residents was selected, prioritizing residents in the later stages of training and ensuring broad geographic representation. This response confirmed residents’ commitment to developing digital commons and supporting generative AI projects in healthcare. To ensure consistency and quality of the outputs, a structured support framework was implemented, including regular webinars, methodological guidelines, a centralized communication platform, and dedicated support. Contributors were also involved in methodological refinements through specialty-specific meetings, enabling adaptation of instructions to clinical practice and ensuring the validity and representativeness of the corpus. From a financial perspective, operational costs primarily consisted of physician compensation, increased by the management fees of the temporary employment agency (multiplicative coefficient of 1.9 applied to the gross remuneration). The production of the 7,394 clinical reports, totaling 5,518 hours of effective work, resulted in a total operational cost of approximately €495,000. 3 Data Records PARHAF consists of a single JSON file containing structured metadata about fictitious patients and the clinical documents associated with them. Each entry in the data array corresponds to one patient record and includes patient-level metadata, contextual information about the care scenario, and a list of associated documents. The documents themselves are not embedded in the JSON file. Instead, each document is referenced via a relative file path pointing to an external text file. These text files are stored separately and organized by medical specialty, with one directory per specialty. Each document file contains raw, unannotated plain text in French, with no markup, labels, or structural tags applied. The JSON file, therefore, acts as the index and metadata layer of the corpus, while the directory structure contains the raw textual content. The linkage between metadata and text relies exclusively on the relative paths specified in the documents[].path fields. The high-level structure of the JSON format and the path-based schema description are described in Figure 1 and Table 1, respectively. In addition to this standalone dataset, we also distribute a Hugging Face dataset (Parquet/Arrow) that is a derived representation generated automatically from the JSON files. Both formats, therefore, contain identical information and differ only in storage layout. The PARHAF corpus is openly available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license and the Etalab 2.0 license. It was released on March 25, 2026. The primary distribution is on Hugging Face (https://huggingface.co/datasets/HealthDataHub/PARHAF). 4 Data Overview A subset of the dataset is temporarily embargoed to enable future evaluations under controlled conditions, thereby limiting the risk of large language model contamination through prior exposure to the data. In total, we release 4,254 patients (6,185 documents) and keep 755 patients (1,209 documents) under embargo for future release. Figures 2, 3, and 4 provide details about, respectively, the patient count per medical specialty, the population pyramid chart, and the document and word counts by document type. 5 Technical Validation All contributors involved in report writing and validation completed standardized online training covering the study objectives, authoring guidelines, and validation procedures. The full clinical scenario framework was presented in detail to ensure a consistent understanding of context and expectations. Additional specialty-specific instructions were provided where relevant, including requirements regarding the number and types of documents per patient and scenario-dependent constraints. To preserve ecological validity and avoid burdening contributors with rigid formatting that would diverge from real-world clinical practice, reports were produced as guided but free-text documents within an online text-editing environment. Human validators were responsible for content quality control, assessing clinical coherence and adherence to the instructions. This step resulted in a rejection rate of 3.6% of submitted documents. Beyond human review, an automated “sanity check” pipeline was implemented to verify compliance with key structural and procedural requirements. These checks included validating document type and the expected number of documents per patient, confirming final hospitalization duration entries, and recording required clinical procedures or acts. This automated stage ensured the consistency of core structured elements across the corpus. Following automated validation, 120 documents required manual correction. These interventions addressed minor typographical errors within structured fields, format-style deviations that interfered with automated parsing, or unauthorized modifications to the original instruction template. No corrections altered the underlying clinical content; all changes were limited to restoring technical conformity with the dataset specifications. 6 Data Availability The PARHAF corpus is openly available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license, and the Etalab 2.0 license. It was released on March 25, 2026. The primary distribution is on Hugging Face at https://huggingface.co/datasets/HealthDataHub/PARHAF. 7 Code Availability The code used for data processing and quality control is publicly available on GitHub at https://github.com/xtannier/PAHRAF_cleaning_and_publication. This includes scripts for converting source documents from .docx to plain text, generating JSON metadata, building the Hugging Face Parquet dataset, and running the automated sanity-check pipeline described above. References [1] M. Becker and B. Böckmann (2016) Extraction of UMLS concepts using Apache cTAKES for German language. In Health Inform. Meets eHealth, G. Schreier, E. Ammenwerth, A. Hörbst, and D. Hayn (Eds.), Studies in Health Technology and Informatics, Vol. 223, pp. 71–76. External Links: Link, Document Cited by: §1.1. [2] G. Berthelier, A. Boutet, and A. Richard (2023) Toward training NLP models to take into account privacy leakages. In 2023 IEEE Int. Conf. Big Data (BigData), pp. 4854–4862. Cited by: §1.1. [3] L. Campillos, L. Deléger, C. Grouin, T. Hamon, A. Ligozat, and A. Névéol (2018) A French clinical corpus with comprehensive semantic annotations: development of the Medical Entity and Relation LIMSI annOtated Text corpus (MERLOT). Lang. Resour. Eval. 52 (2), pp. 571–601. Cited by: §1.1. [4] R. Chevrier, V. Foufi, C. Gaudet-Blavignac, A. Robert, and C. Lovis (2019-May 31) Use and understanding of anonymization and de-identification in the biomedical literature: scoping review. J. Med. Internet Res. 21 (5), pp. e13484. Cited by: §1.1. [5] F. Dernoncourt, J. Y. Lee, O. Uzuner, and P. Szolovits (2016-Dec 30) De-identification of patient notes with recurrent neural networks. J. Am. Med. Inform. Assoc.. Cited by: §1.1. [6] F. Estignard, S. Ghannay, J. Girard-Satabin, N. Hiebel, and A. Névéol (2025) Evaluating the confidentiality of synthetic clinical texts generated by language models. In Int. Conf. Artif. Intell. Med. (AIME), pp. 130–139. Cited by: §1.1. [7] J. Frei, L. Frei-Stuber, and F. Kramer (2023) GERNERMED++: semantic annotation in German medical NLP through transfer-learning, translation and word alignment. J. Biomed. Inform. 147, pp. 104513. External Links: Link, Document Cited by: §1.1. [8] J. Frei and F. Kramer (2023) Annotated dataset creation through large language models for non-English medical NLP. J. Biomed. Inform. 145, pp. 104478. External Links: Link, Document Cited by: §1.1. [9] Y. Gao, D. Dligach, L. Christensen, S. Tesch, R. Laffin, D. Xu, T. Miller, O. Uzuner, M. M. Churpek, and M. Afshar (2022-Sep 12) A scoping review of publicly available language tasks in clinical natural language processing. J Am Med Inform Assoc 29 (10), pp. 1797–1806. Cited by: §1.1, §1.1. [10] N. Grabar, V. Claveau, and C. Dalloux (2018) CAS: French corpus with clinical cases. In Proc. 9th Int. Workshop Health Text Mining Inf. Anal., pp. 122–128. Cited by: §1.1. [11] C. Grouin, A. Rosier, O. Dameron, and P. Zweigenbaum (2009) Testing tactics to localize de-identification. In Proc. MIE 2009, K. Adlassnig, J. Mantas, and B. Blobel (Eds.), Studies in Health Technology and Informatics, Vol. 150, Amsterdam, pp. 735–739. Cited by: §1.1. [12] U. Hahn (2025-06) Clinical document corpora-real ones, translated and synthetic substitutes, and assorted domain proxies: a survey of diversity in corpus design, with focus on German text data. JAMIA Open 8 (3), pp. ooaf024. External Links: Document Cited by: §1.1. [13] N. Hiebel, O. Ferret, K. Fort, and A. Névéol (2023) Can synthetic text help clinical named entity recognition? a study of electronic health records in French. In Proc. 17th Conf. Eur. Chapter Assoc. Comput. Linguist. (EACL), pp. 2320–2338. Cited by: §1.1. [14] J. Ive, N. Viani, J. Kam, L. Yin, S. Verma, S. Puntis, R. N. Cardinal, A. Roberts, R. Stewart, and S. Velupillai (2020) Generation and evaluation of artificial mental health records for natural language processing. NPJ Digit. Med. 3 (1), pp. 69. Cited by: §1.1. [15] A. Jannot, E. Zapletal, P. Avillach, M. Mamzer, A. Burgun, and P. Degoulet (2017) The Georges Pompidou University hospital clinical data warehouse: a 8-years follow-up experience. Int. J. Med. Inform. 102, pp. 21–28. Cited by: §1.1. [16] A. E. W. Johnson, T. J. Pollard, L. Shen, L. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark (2016-May 24) MIMIC-III, a freely accessible critical care database. Sci. Data 3, pp. 160035. External Links: Document Cited by: §1.1, §1.1. [17] E. Laparra, A. Mascio, S. Velupillai, and T. Miller (2021-08) A review of recent work in transfer learning and domain adaptation for natural language processing of electronic health records. Yearb Med Inform 30 (1), pp. 239–244. Cited by: §1.1. [18] B. Le Guellec, K. Adambounou, L. C. Adams, T. Agripnidis, S. S. Ahn, R. Ait Chalal, T. Akinci D’Antonoli, P. Amouyel, H. Andersson, R. Bentegeac, C. Benzoni, A. A. Blandino, F. Busch, E. Can, R. Cau, A. U. Cavallo, C. Chavihot, E. Chiquete, R. Cuocolo, E. Divjak, B. Dziadkowiec-Macek, A. Elogne, S. C. Fanni, C. Ferrarotti, C. Fossataro, F. Fossataro, K. Fułek, M. Fułek, P. Gać, M. Gachowska, I. García-Juárez, M. Gatti, N. Gorelik, A. M. Goulianou, A. Hamroun, N. Herinirina, Q. Holay, G. Ivanac, F. Kitamura, M. E. Klontzas, A. Kompanowska, R. Kompanowski, K. Kraik, D. Krupka, A. Lefèvre, T. Lemke, M. Lindholz, P. Macek, M. Makowski, L. Mannacio, A. Meddeb, L. Müller, A. Natale, B. Nguema Edzang, A. Ojeda, Y. W. Park, F. Piccione, A. Ponsiglione, M. Poręba, R. Poręba, P. Prucker, J. Pruvo, R. alba Pugliesi, F. H. Rabemanorintsoa, V. Rafailidis, K. Resler, J. Rotkegel, L. Saba, E. Siebert, A. Stanzione, A. F. Tekin, L. Toapanta-Yanchapaxi, M. Triantafyllou, E. Tsaoulia, S. Urban, E. Vassalou, F. Vernuccio, W. Wang, J. Wassélius, A. Włodarczak, S. Włodarczak, A. Wysocki, L. Xu, T. Zatoński, S. Zhang, S. Ziegelmayer, G. Kuchcinski, and K. K. Bressem (2026) PARROT, an open multilingual radiology reports dataset. Eur. J. Radiol. Artif. Intell. 5, pp. 100066. External Links: ISSN 3050-5771, Document, Link Cited by: §1.1. [19] C. Lohr, S. Buechel, and U. Hahn (2018) Sharing copies of synthetic clinical corpora without physical distribution - A case study to get around IPRs and privacy constraints featuring the German JSYNCC corpus. In Proc. LREC 2018, N. Calzolari, K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis, and T. Tokunaga (Eds.), External Links: Link Cited by: §1.1. [20] B. Magnini, B. Altuna, A. Lavelli, M. Speranza, and R. Zanoli (2020) The E3C project: collection and annotation of a multilingual corpus of clinical cases. In Proc. of the Seventh Italian Conf. on Computational Linguistics, CLiC-it 2020, Bologna, Italy, March 1-3, 2021, J. Monti, F. Dell’Orletta, and F. Tamburini (Eds.), CEUR Workshop Proceedings, Vol. 2769. External Links: Link Cited by: §1.1. [21] O. Melamud and C. Shivade (2019) Towards automatic generation of shareable synthetic clinical notes using neural language models. In Proc. 2nd Clinical NLP Workshop, pp. 35–45. Cited by: §1.1. [22] A. Miranda-Escalada, E. Farré, and M. Krallinger (2020) Named entity recognition, concept normalization and clinical coding: overview of the Cantemist track for cancer text mining in Spanish, corpus, guidelines, methods and results. In Proc. IberLEF 2020, M. Á. G. Cumbreras, J. Gonzalo, E. M. Cámara, R. Martínez-Unanue, P. Rosso, S. M. Jiménez-Zafra, J. A. O. Zambrano, A. Miranda, J. Porta-Zamorano, Y. Gutiérrez, A. Rosá, M. Montes-y-Gómez, and M. G. Vega (Eds.), CEUR Workshop Proceedings, Vol. 2664, pp. 303–323. External Links: Link Cited by: §1.1. [23] A. Miranda-Escalada, L. Gascó, S. Lima-López, E. Farré-Maduell, D. Estrada, A. Nentidis, A. Krithara, G. Katsimpras, G. Paliouras, and M. Krallinger (2022) Overview of DisTEMIST at BioASQ: automatic detection and normalization of diseases from clinical texts: results, methods, evaluation and multilingual resources. In Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum, Bologna, Italy, September 5th - to - 8th, 2022, G. Faggioli, N. Ferro, A. Hanbury, and M. Potthast (Eds.), CEUR Workshop Proceedings, Vol. 3180, pp. 179–203. External Links: Link Cited by: §1.1. [24] L. Modersohn, S. Schulz, C. Lohr, and U. Hahn (2022) GRASCCO - the first publicly shareable, multiply-alienated German clinical text corpus. In Ger. Med. Data Sci. 2022, R. Röhrig, N. Grabe, V. S. Hoffmann, U. Hübner, J. König, U. Sax, B. Schreiweis, and M. Sedlmayr (Eds.), Studies in Health Technology and Informatics, Vol. 296, pp. 66–72. External Links: Link, Document Cited by: §1.1. [25] N. Moore, P. Blin, R. Lassalle, N. Thurin, P. Bosco-Levy, and C. Droz (2021) National health insurance claims database in France (SNIRAM), système national des données de santé (SNDS) and Health Data Hub (HDH). In Databases for pharmacoepidemiological research, pp. 131–140. Cited by: §1.2, §2.2.1, §2.3.1. [26] I. Neamatullah, M. M. Douglass, L. H. Lehman, A. Reisner, M. Villarroel, W. J. Long, P. Szolovits, G. B. Moody, R. G. Mark, and G. D. Clifford (2008-07) Automated de-identification of free-text medical records. BMC Med. Inform. Decis. Mak. 8. External Links: Document, ISSN 1472-6947, Link Cited by: §1.1. [27] A. Névéol, H. Dalianis, S. Velupillai, G. Savova, and P. Zweigenbaum (2018) Clinical natural language processing in languages other than English: opportunities and challenges. J. Biomed. Semant. 9 (1), pp. 12. Cited by: §1.1. [28] M. Neves, A. J. Yepes, A. Siu, R. Roller, P. Thomas, M. V. Navarro, L. Yeganova, D. Wiemann, G. M. Di Nunzio, F. Vezzani, et al. (2022) Findings of the WMT 2022 biomedical translation shared task: monolingual clinical case reports. In Proc. 7th Conf. Mach. Transl. (WMT), pp. 694–723. Cited by: §1.1. [29] J. P. Pestian, C. Brew, P. Matykiewicz, D. Hovermale, N. Johnson, K. B. Cohen, and W. Duch (2007-06) A shared task involving multi-label classification of clinical free text. In Biol. Transl. Clin. Lang. Process., K. B. Cohen, D. Demner-Fushman, C. Friedman, L. Hirschman, and J. Pestian (Eds.), Prague, Czech Republic, pp. 97–104. External Links: Link Cited by: §1.1. [30] M. Saeed, M. Villarroel, A. T. Reisner, G. Clifford, L. Lehman, G. Moody, T. Heldt, T. H. Kyaw, B. Moody, and R. G. Mark (2011-05) Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-II): a public-access intensive care unit database. Crit. Care Med. 39, pp. 952–960. Note: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3124312/ Cited by: §1.1, §1.1. [31] J. Sinclair (2004) Corpus and text: basic priniciples. In Developing Linguistic Corpora: a Guide to Good Practice, M. Wynne (Ed.), External Links: ISSN 1463 5194 Cited by: §1.2. [32] L. Sweeney (2002-10) K-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10 (5), pp. 557–570. External Links: ISSN 0218-4885, Link, Document Cited by: §2.3.1. [33] X. Tannier, P. Wajsbürt, A. Calliger, B. Dura, A. Mouchet, M. Hilka, and R. Bey (2024) Development and validation of a natural language processing algorithm to pseudonymize documents in the context of a clinical data warehouse. Methods Inf. Med. 63 (01/02), pp. 021–034. Cited by: §1.1. [34] Ö. Uzuner, Y. Luo, and P. Szolovits (2007) Evaluating the state-of-the-art in automatic de-identification. J. Am. Med. Inform. Assoc. 14, pp. 550–563. External Links: Document Cited by: §1.1, §1.1. [35] T. Vakili, A. Henriksson, and H. Dalianis (2025-07) Data-constrained synthesis of training data for de-identification. In Proc. 63rd Annu. Meet. Assoc. Comput. Linguist. (ACL), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar (Eds.), Vienna, Austria, pp. 27414–27427. External Links: Link, Document, ISBN 979-8-89176-251-0 Cited by: §1.1. [36] J. Zaghir, M. Bjelogrlic, J. Goldman, S. Aananou, C. Gaudet-Blavignac, and C. Lovis (2024-05) FRASIMED: a clinical French annotated resource produced through crosslingual BERT-based annotation projection. In Proc. LREC-COLING 2024, N. Calzolari, M. Kan, V. Hoste, A. Lenci, S. Sakti, and N. Xue (Eds.), Torino, Italia, pp. 7450–7460. External Links: Link Cited by: §1.1. [37] P. Zweigenbaum, P. Jacquemart, N. Grabar, and B. Habert (2001) Building a text corpus for representing the variety of medical language. Stud Health Technol Inform 84 (Pt 1), pp. 290–294. Cited by: §1.2. 8 Author Contributions Xavier Tannier: Conceptualization, Methodology, Software, Validation, Data Curation, Writing - Original Draft, Writing - Review & Editing, Visualization, Funding acquisition. Salam Abbara: Conceptualization, Methodology, Writing - Review & Editing, Funding acquisition. Rémi Flicoteaux: Conceptualization, Methodology, Validation, Writing - Review & Editing. Youness Khalil: Methodology, Writing - Review & Editing, Project administration. Aurélie Névéol: Conceptualization, Writing - Review & Editing, Funding acquisition. Pierre Zweigenbaum: Conceptualization, Writing - Review & Editing, Funding acquisition. Emmanuel Bacry: Conceptualization, Supervision, Project coordinator, Funding acquisition, Writing - Review & Editing. 9 Competing Interests The authors declare no competing interests related to this work. 10 Acknowledgements We thank the authors of the reports for their contribution and feedback on the protocol, as well as the PARTAGES consortium members for fruitful discussions towards corpus development. We also thank Florian Pons for helping with operational support and project coordination. 11 Funding This work was carried out as part of the PARTAGES project, awardee of the Bpifrance France 2030 call for proposals “Digital Commons for Generative Artificial Intelligence.” 12 Ethics statement* Through their affiliations with French public service agencies, the developers of the PARHAF corpus have benefited from access to SNDS data. The clinical scenarios used to write the clinical documents in the PARHAF corpus are based on aggregated public health statistics and do not pertain to identifiable real patients. No private information about individual subjects was used in this study; therefore, no IRB or ethics approval was required to create or distribute the PARHAF corpus. The clinical document authors involved in this study were apprised of the full document creation protocol. Participation was voluntary, and authors were compensated for their work in accordance with French labor laws. The PARHAF corpus is intended for use as educational material and as support for the development and evaluation of clinical NLP systems. It is not intended for clinical use. Supplementary Material Figure S1 shows the changes in distribution for each specialty, after the adjustment described in Section 2.3.1. Figure S2 illustrates the number of patients written by each author.
|
Scooped by
Gilbert C FAURE
March 29, 3:50 AM
|
98% people don’t fail with AI…
They’re just stuck on beginner tools forever.
And then wonder why results stay average.
When ChatGPT and some AI tools launched towards the end of 2022, I was doing everything the “easy way”. → I used basic prompts → Used generic tools → Got average results But it felt productive.
But now I do 10x more work… in half the time. With a better tool stack.
That’s when it clicked: AI rewards leverage.
And the gap between novice and expert level AI tools is where that leverage lives.
Here’s what changed everything for me:
• Beginners use AI to assist • Experts use AI to do the work
• Beginners ask questions • Experts build workflows and agents
• Beginners chase outputs • Experts chase systems
Now look at how the shift actually plays out:
→ Presentations: Gamma creates beautiful presentations → Data: Rows for structured analysis → Research: ChatGPT Deep Research for depth → Learning: NotebookLM for different learning formats → Video: VEED for end-to-end video content editing → Coding: Claude Code for precision → Websites: Webflow for aesthetic websites → Apps: Replit to build anything
The difference is the level of thinking behind the tool.
Do’s for expert-level AI usage: ✔️ Build repeatable workflows ✔️ Combine multiple tools into one pipeline ✔️ Use AI for execution ✔️ Validate outputs with real-world data ✔️ Continuously upgrade your stack
Don’ts for expert-level AI usage: ✖️ Don’t rely on a single tool for everything ✖️ Don’t skip context ✖️ Don’t blindly trust outputs ✖️ Don’t ignore speed vs quality tradeoffs ✖️ Don’t stay comfortable with beginner setups
Most people get stuck because beginner tools feel “good enough”. But “good enough” is the enemy of scale.
If you want unfair advantage, You need better tools. Used like an expert.
Check the infographic 👇 I’ve covered all the expert-level tools clearly.
And now I’m curious: What do you think about using expert-level AI tools? Are they overkill or the real edge? Comment below 👇
➕ Follow for more breakdowns like this.
♻️ Repost to help your network use expert-level AI tools | 116 comments on LinkedIn
|