Exploring AIChE 2015 Technical Program (PD2M topic)

AIChE (American Institute of Chemical Engineers) holds an annual meeting in November. It has several topical sessions. I wanted to explore the topic session called PD2M (Pharmaceutical Discovery, Development and Manufacturing Forum).

I wanted to answer the following questions:

Disclaimer: This is meant to be a fun exercise. It is not meant to be comprehensive and accurate but rather exploratory. So I would suggest not over-interpreting things said here. Take everything said here with a grain of salt.

Exploring the data

The PD2M topic has ~30 sessions, ~190 talks in which ~600 authors are presenting and ~115 organizations (industry and academia) are represented. Among organizations, the split between academia and industry is close to 50/50.

The graph below lists the folks giving at least 4 talks and the number of talks they are giving.

The graph below lists organizations with least 5 talks and the number of talks from each organization.

About 13% of talks have joint authors from industry and academia, 47% with industry only authors, and 40% with academia only authors

The graph below shows this split for each session.

The list of talks with both academia and industry authors is here. Some of the session that are heavy on academic talks are:

With sessions on application of continuous manufacturing in drug product (session I has more industry talks and session II has more talks from academia)

Themes/Topics across talks

The sessions already represent a categorization of talks. But there are also several text mining algorithms that try to detect themes/topics underlying a set of documents. One of them is topic models. I wanted to take abstracts of talks as inputs and use topic models to determine underlying topics. We have to specify the number of topics that we want the algorithm to find. The output from the algorithm consists of likely words for each topic and the likelihood of a document containing a topic. We need to assign labels to each topic based on the likely words occuring in the topic. This is subjective. In reality, more than one topic could be covered in one document, but I used the simplistic approach of assigning the topic with highest probability to be the topic that an abstract represents. In this case, I used the topicmodels R package with abstracts of the talks. I chose 10 topics. Also, I just used the functions with default values (I didn’t tune any parameters) to get the topics. The figure below shows the top likely words in each of the 10 topics.

Based on the terms, I tentatively assigned the following topic names to each topic:

The number of talks in each topic is given in the figure below. The talks assigned to each topic can be accessed by clicking the links above for each topic. This is good place to put some caveats. As you look at specific talks, there will be several instances where it might seem that the talk doesn’t fit in the topic. There are several reasons for that (forcing document to be in one topic, working with default parameters to determine topics, preprocessing prior to running the model). But it seems like even this simple version seems to qualitatively capture the topics even if not exactly for each talk. For the rest of this discussion, I am going to work with these topics assignments inspite of the above said caveats.

The split of talks in each session across the 10 topics is show in the heat map below (rows sum to 1, more blue implies higher % of session talks in that topic)

A few interesting observations are:

  • The session on product performance models is very heavy on modeling
  • The continous, crystallization, and bio categories seems to correlate reasonably with sessions on continuous processing, crystallization and bio sessions respectively.

The next figure shows the split of talks across topics between industry and academia.

Industry has high focus on process dev/modeling (that’s the bread and butter) but academia doesn’t have much emphasis in that topic. Academia has higher percentage of talks compared to industry in the areas of crystallization, HME, and continuous.

The next figure shows the split of talks across topics for organizations that have several talks this year.

The academia results are driven by Rutgers University, RCPE GmbH, Graz University, and New Jersey Institute of Technology. HME talks are by RCPE Gmbh, Graz University. Crystallization talks are mainly from Purdue University. Dissolution/Enabled formulation have talks from Dow and New Jersey Institute of Technology.

Final Thoughts

I don’t know if you got something that you didn’t already know. I found it useful to see this summary and this was a good learning and fun exercise. Hope you had some fun reading it too. For somebody wanting to reproduce or improve on this, all the code details and links to code are here.